This is an exploratory data analysis of the mHealth data from:
Digital Health: Tracking Physiomes and Activity Using Wearable Biosensors Reveals Useful Health-Related Information Xiao Li , Jessilyn Dunn , Denis Salins , Gao Zhou, Wenyu Zhou, Sophia Miryam Schüssler-Fiorenza Rose, Dalia Perelman, Elizabeth Colbert, Ryan Runge, Shannon Rego, Ria Sonecha, Somalee Datta, Tracey McLaughlin, Michael P. Snyder Published: January 12, 2017 https://doi.org/10.1371/journal.pbio.2001402
The argument that I’m trying to make is that the SeeClickFix or crime data are similar to those from the Snyder data. They have seasonal variability and lots of things changing, so it can be difficult to isolate the effect of an intervention. Let’s take the example of an exercise intervention, where individuals are asked to increase their activity and monitor the effect on some other outcome, like heart rate. You can use the variability to which different individuals followed the intervention as a means to identify the signal from that intervention.
The two that I expect to relate are MOVES and heart rate.
Activity is likely to have an effect on heart rate, but are there some activities that have a larger effect than others? Can we figure out which is most effective at reducing heart rate by comparing how much the different activities are performed by different people?
This is similar to SeeClickFix and Crime, in that there are many different categories of posts, and they may effect crime to varying degrees.
I suppose I should just check a standard regression after an intervention and see if the amount of seeclickfix use was significant.
These are the devices for which data are available on the paper’s website:
HR and SpO2 data collected using four devices (Scanadu Scout, iHealth-finger, Masimo, and Basis) were very close to that of the WA instrument over a wide range of values using the Bland–Altman method of comparison [16,17]
So I’ll compare the MOVES data, as an analog of SeeClickFix, to heart rate, collected by any of four devices. First I’ll check out the MOVES data.
files <- list.files("wearables/Subject1_rawdata/MOVES/", full.names = TRUE)
head(files)
## [1] "wearables/Subject1_rawdata/MOVES//activities.csv"
moves <- read.csv(files)
There are a series of different activites with dates and duration time stamps.
summary(moves)
## Date Activity Group
## 3/18/15 : 48 walking :14011 : 7
## 7/24/14 : 43 transport : 1874 cycling : 880
## 8/23/14 : 43 cycling : 880 running : 184
## 3/24/15 : 40 running : 184 transport: 1923
## 1/5/16 : 39 airplane : 49 walking :14011
## 12/27/15: 39 basketball: 3
## (Other) :16753 (Other) : 4
## Start End
## 2014-09-18T12:32:55-07:00: 2 2014-03-21T15:26:27-07:00: 1
## 2014-03-21T15:26:22-07:00: 1 2014-03-21T16:04:14-07:00: 1
## 2014-03-21T16:02:33-07:00: 1 2014-03-21T16:26:22-07:00: 1
## 2014-03-21T16:05:20-07:00: 1 2014-03-21T16:30:20-07:00: 1
## 2014-03-21T16:26:43-07:00: 1 2014-03-21T17:15:32-07:00: 1
## 2014-03-21T17:15:04-07:00: 1 2014-03-21T18:19:28-07:00: 1
## (Other) :16998 (Other) :16999
## Duration Distance Steps Calories
## Min. : 0.0 Min. : 0.000 Min. : 0.0 Min. : 0.0
## 1st Qu.: 90.0 1st Qu.: 0.040 1st Qu.: 36.0 1st Qu.: 2.0
## Median : 203.0 Median : 0.138 Median : 173.0 Median : 8.0
## Mean : 513.9 Mean : 19.219 Mean : 334.2 Mean : 17.6
## 3rd Qu.: 434.0 3rd Qu.: 0.367 3rd Qu.: 403.0 3rd Qu.: 18.0
## Max. :50207.0 Max. :7572.527 Max. :8170.0 Max. :567.0
##
Each date has ~40 entries, comprised of different activities like cycling and running. I’ll collapse all of the different activities into a single entry per day.
moves1 <- moves %>%
ddply(c("Date", "Group"), summarize,
duration = sum(Duration),
distance = sum(Distance),
calories = sum(Calories))
# I'm going to exclude this for now, since "transport" is also in the list, and
# the total amount of moving + transport time doesn't really make sense. I should do the total active time.
movesActive <- moves1[-grep("transport", moves1$Group),]
movesActive <- movesActive %>%
ddply("Date", summarize,
duration = sum(duration),
distance = sum(distance),
calories = sum(calories))
movesActive$Group <- "active"
moves2 <- rbind(moves1, movesActive)
moves2$Date <- as.Date(as.character(moves2$Date), format = "%m/%d/%y")
moves2 <- moves2[order(moves2$Date),]
moves2 %>%
ggplot(aes(x = Date, y = duration)) +
geom_line(aes(group = Group, color = Group))
Why is that so spikey? Maybe I have to smooth by collapsing to weeks? Also, what’s that red line without a label?
table(moves2$Group)
##
## cycling running transport walking active
## 6 321 167 529 740 740
moves2[grep("^$", moves2$Group),]
## Date Group duration distance calories
## 1647 2014-09-15 5100 0 351
## 1662 2014-09-18 2100 0 199
## 1666 2014-09-19 3900 0 323
## 167 2014-10-12 1200 0 140
## 553 2014-12-31 1800 0 93
## 1660 2015-09-17 2100 0 199
Those have a huge duration! Since I don’t know what they are I’ll remove them.
moves2 <- moves2[-grep("^$", moves2$Group),]
moves2 %>%
ggplot(aes(x = Date, y = duration)) +
geom_line(aes(group = Group, color = Group))
Still too spikey. I’ll aggregate to week.
moves2$week <- format(moves2$Date, format = "%Y-%U")
moves3 <- moves2 %>%
ddply(c("week", "Group"), summarize,
duration = sum(duration),
distance = sum(distance),
calories = sum(calories))
moves3 %>%
ggplot(aes(x = week, y = duration)) +
geom_line(aes(group = Group, color = Group)) +
theme(
axis.text.x = element_blank()
)
What if I exclude transport?
moves_notrans <- moves3[-grep("transport", moves3$Group),]
pmoves <- moves_notrans %>%
ggplot(aes(x = week, y = duration)) +
geom_line(aes(group = Group, color = Group)) +
theme(
axis.text.x = element_blank()
) +
labs(title = "MOVES data",
subtitle = "Activity over time")
pmoves
How about by month?
moves2$month <- format(moves2$Date, format = "%Y-%m")
moves4 <- moves2 %>%
ddply(c("month", "Group"), summarize,
duration = sum(duration),
distance = sum(distance),
calories = sum(calories))
moves4 %>%
ggplot(aes(x = month, y = duration)) +
geom_line(aes(group = Group, color = Group)) +
theme(
axis.text.x = element_blank()
) +
labs(title = "MOVES data",
subtitle = "Activity over time")
I like the weeks best.
Ok, now I need the heart rate information. First I’ll check masimo.
files <- list.files("wearables/Subject1_rawdata/Masimo/", full.names = TRUE)
head(files)
## [1] "wearables/Subject1_rawdata/Masimo//Subject1_MPSnyder_masimohealth_010516_to_011916.csv"
## [2] "wearables/Subject1_rawdata/Masimo//Subject1_MPSnyder_masimohealth_010516_to_012816.csv"
## [3] "wearables/Subject1_rawdata/Masimo//Subject1_MPSnyder_masimohealth_012816_to_021116.csv"
## [4] "wearables/Subject1_rawdata/Masimo//Subject1_MPSnyder_masimohealth_021216_to_021216.csv"
## [5] "wearables/Subject1_rawdata/Masimo//Subject1_MPSnyder_masimohealth_021716_to_021716.csv"
## [6] "wearables/Subject1_rawdata/Masimo//Subject1_MPSnyder_masimohealth_021916_to_030316.csv"
masimo <- lapply(files, function(x) read.csv(x, skip = 1, header = FALSE, as.is = TRUE))
# Create a single dataframe
df <- do.call(rbind, masimo)
# Change column names
names(df) <- c("Session", "Index", "Timestamp", "Day", "Date", "Year", "Time",
"SpO2", "Pulse Rate")
# Convert to long format and visualize
df[, c("Timestamp", "SpO2", "Pulse Rate")] %>%
gather(measurement, value, SpO2, "Pulse Rate", -Timestamp) %>%
ggplot(aes(x = Timestamp, y = value)) +
geom_line(aes(group = measurement, color = measurement)) +
labs(title = "Masimo data",
subtitle = "Heart rate and Blood Oxygen Saturation over time")
Wow there are a lot of dropouts, I should remove outliers and view the mean per hour, or day or something like that. First I’ll have to wrangle the time into an easier format. I wonder if this would be easier if I used one of the other device’s data? I’ll check a few others for easier date measurements.
files <- list.files("wearables/Subject1_rawdata/ihealth-finger/", full.names = TRUE)
head(files)
## [1] "wearables/Subject1_rawdata/ihealth-finger//ihealth.txt"
x <- read.table(files, sep = "\t", header = TRUE, as.is = TRUE)
head(x)
## Date Time TotalTime SpO2 Pulse X
## 1 6/19/15 14:44 6/19/15 14:44 97% 101 bpm
## 2 6/19/15 14:45 6/19/15 14:45 96% 90 bpm
## 3 6/19/15 15:56 6/19/15 15:56 96% 98 bpm
## 4 6/19/15 15:58 6/19/15 15:58 97% 93 bpm
## 5 6/19/15 16:08 6/19/15 16:08 97% 93 bpm
## 6 6/19/15 16:20 6/19/15 16:20 98% 95 bpm
dim(x)
## [1] 5151 6
# Reformat columns
x$time <- as.POSIXct(x$TotalTime, tz = "America/Los_Angeles", format = "%m/%d/%y %H:%M")
x$SpO2 <- as.numeric(gsub("%", "", x$SpO2))
x$Pulse <- as.numeric(gsub(" bpm", "", x$Pulse))
x[, c("time", "SpO2", "Pulse")] %>%
gather(measurement, value, SpO2, Pulse, -time) %>%
ggplot(aes(x = time, y = value)) +
geom_line(aes(group = measurement, color = measurement)) +
labs(title = "ihealth-finger Data",
subtitle = "Heart rate and Blood Oxygen Saturation over time")
There are about 100 fold fewer observations for ihealth finger than for Masimo. I’ll go back to Masimo and clean it up a bit.
# Format the date
df$datetime <- paste(df$Date, df$Year, df$Time, sep = ", ") %>%
gsub(" Pacific Standard Time", "", .) %>%
as.POSIXct(datetime, tz = "America/Los_Angeles", format = " %B %d, %Y, %I:%M:%S %p")
dfp <- df[, c("datetime", "SpO2", "Pulse Rate")]
# Drop zeros
# sd3p <- which(dfp$Pulse < mean(dfp$Pulse) - sd(dfp$Pulse)*4)
sd3p <- which(dfp$Pulse < 20)
dfp$Pulse[sd3p] <- NA
# sd3s <- which(dfp$SpO2 < mean(dfp$SpO2) - sd(dfp$SpO2)*4)
sd3s <- which(dfp$SpO2 < 20)
dfp$SpO2[sd3s] <- NA
dfp %>%
gather(measurement, value, SpO2, "Pulse Rate", -datetime, na.rm = TRUE) %>%
ggplot(aes(x = datetime, y = value)) +
geom_line(aes(group = measurement, color = measurement)) +
labs(title = "Masimo data",
subtitle = "Heart rate and Blood Oxygen Saturation over time")
Maybe the Withings data would be better? Weight vs activity would be easier for people to connect with?
files <- list.files("wearables/Subject1_rawdata/Withings/", full.names = TRUE)
head(files)
## [1] "wearables/Subject1_rawdata/Withings//Withings_activity_and_calories.csv"
## [2] "wearables/Subject1_rawdata/Withings//Withings_weight.xlsx"
withings <- readxl::read_xlsx(files[2])
Oh man that looks so clean. The date is already formatted correctly.
withings %>%
ggplot(aes(x = Date, y = Weight)) +
geom_line() +
theme_bw() +
labs(title = "Withings data",
subtitle = "Weight over time")
There are a few extremely low values – I’ll remove those
withings2 <- withings[!(withings$Weight < 120 | withings$Weight > 153), ]
pwithings <- withings2 %>%
ggplot(aes(x = Date, y = Weight)) +
geom_line() +
theme_bw() +
labs(title = "Withings data",
subtitle = "Weight over time")
pwithings
gridExtra::grid.arrange(pmoves, pwithings, ncol = 1)
files <- list.files("wearables/Subject1_rawdata/Withings", full.names = TRUE)
head(files)
## [1] "wearables/Subject1_rawdata/Withings/Withings_activity_and_calories.csv"
## [2] "wearables/Subject1_rawdata/Withings/Withings_weight.xlsx"
withingso <- read_csv(files[1])
## Parsed with column specification:
## cols(
## Date = col_character(),
## Steps = col_integer(),
## `Active calories` = col_integer()
## )
names(withingso)
## [1] "Date" "Steps" "Active calories"
withingso$Date <- as.Date(withingso$Date, format = "%m/%d/%y")
withingso %>%
gather(measurement, value, -Date) %>%
ggplot(aes(x = Date, y = value)) +
geom_line(aes(group = measurement, color = measurement)) +
labs(title = "Withings data",
subtitle = "Steps and Active Calories over time")
## Warning: Removed 116 rows containing missing values (geom_path).
files <- list.files("wearables/Subject1_rawdata/Scanadu/", full.names = TRUE)
head(files)
## [1] "wearables/Subject1_rawdata/Scanadu//ScanaduData.tsv"
scanadu <- read.table(files, sep = "\t", header = TRUE)
names(scanadu)
## [1] "Data" "Time" "HR" "SpO2"
# Plot the number of readings per day for the devices reporting HR and SpO2
rbind(data.frame(table(scanadu$Data), device = "Scanadu"),
data.frame(table(df$Date), device = "Masimo"),
data.frame(table(x$Date), device = "iHealth-finger")) %>%
ggplot(aes(Freq)) +
geom_density(stat = "bin", aes(fill = device), alpha = 0.5) +
labs(x = "Mesurements per day",
fill = "Device",
title = "Number of recorded measurements per day",
subtitle = "The three devices for HR and SpO2") +
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
files <- list.files("wearables/Subject1_rawdata/Radiation/", full.names = TRUE)
head(files)
## [1] "wearables/Subject1_rawdata/Radiation//Stanford_Wearables_Radiation_06072016.xlsx"
rad <- readxl::read_xlsx(files)
radlab <- rad[1,2]
rad <- rad[-1,]
names(rad) <- c("datetime", "rad")
rad$rad <- as.numeric(rad$rad)
rad$datetime <- as.POSIXct(rad$datetime, tz = "America/Los_Angeles", format = "%Y-%m-%d %H")
ggplot(rad, aes(x = datetime, y = rad)) +
geom_line() +
theme_bw() +
labs(title = "RadTarge radiation measurements",
y = radlab,
x = "Date")
files <- list.files("wearables/Subject1_rawdata/Basis/Basis_Peak/", full.names = TRUE)
head(files)
## [1] "wearables/Subject1_rawdata/Basis/Basis_Peak//bodymetrics_to_03-31-16.csv"
## [2] "wearables/Subject1_rawdata/Basis/Basis_Peak//sleep_to_03-31-16.csv"
body <- read.csv(files[1])
body$date <- gsub("Z$", "", body$date)
head(body$date)
## [1] "2015-05-20 18:50" "2015-05-20 18:51" "2015-05-20 18:52"
## [4] "2015-05-20 18:53" "2015-05-20 18:54" "2015-05-20 18:55"
body$date <- as.POSIXct(body$date, tz = "America/Los_Angeles", format = "%Y-%m-%d %H:%M")
body[,c("date", "heart.rate", "skin.temp")] %>%
gather(measurement, value, -date) %>%
ggplot(aes(date, value)) +
geom_line(aes(group = measurement, color = measurement)) +
theme_bw() +
labs(title = "Basis Data",
subtitle = "Heart Rate and Skin Temperature")
## Warning: Removed 128 rows containing missing values (geom_path).
body %>%
ggplot(aes(date, gsr)) +
geom_line() +
theme_bw() +
labs(title = "Basis Data",
subtitle = "Galvanic Skin Response")
## Warning: Removed 64 rows containing missing values (geom_path).
sleep <- read.csv(files[2], as.is = TRUE)
summary(sleep)
## local_start_time local_end_time heart_rate_avg calories
## Length:711 Length:711 Min. : 51.15 Min. : 1.964
## Class :character Class :character 1st Qu.: 62.21 1st Qu.: 49.100
## Mode :character Mode :character Median : 67.35 Median : 147.900
## Mean : 67.65 Mean : 162.758
## 3rd Qu.: 72.32 3rd Qu.: 244.700
## Max. :106.18 Max. :1331.600
## actual_minutes light_minutes deep_minutes rem_minutes
## Min. : 20.0 Min. : 20.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 50.5 1st Qu.: 45.50 1st Qu.: 0.00 1st Qu.: 0.00
## Median :150.0 Median : 91.00 Median : 21.00 Median : 33.00
## Mean :163.6 Mean : 94.93 Mean : 30.53 Mean : 37.78
## 3rd Qu.:248.0 3rd Qu.:131.50 3rd Qu.: 52.00 3rd Qu.: 65.00
## Max. :632.0 Max. :398.00 Max. :168.00 Max. :145.00
## interruption_minutes unknown_minutes interruptions toss_and_turn
## Min. : 0.000 Min. : 0.000 Min. :0.0000 Min. : 0.00
## 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.:0.0000 1st Qu.: 4.00
## Median : 0.000 Median : 0.000 Median :0.0000 Median :10.00
## Mean : 1.778 Mean : 0.121 Mean :0.3446 Mean :11.74
## 3rd Qu.: 0.000 3rd Qu.: 0.000 3rd Qu.:1.0000 3rd Qu.:17.00
## Max. :28.000 Max. :23.000 Max. :4.0000 Max. :63.00
## start_timestamp start_time_iso start_time_timezone
## Min. :1.433e+09 Length:711 Length:711
## 1st Qu.:1.440e+09 Class :character Class :character
## Median :1.445e+09 Mode :character Mode :character
## Mean :1.446e+09
## 3rd Qu.:1.451e+09
## Max. :1.459e+09
## start_time_offset end_timestamp end_time_iso
## Min. :-480.0 Min. :1.433e+09 Length:711
## 1st Qu.:-480.0 1st Qu.:1.440e+09 Class :character
## Median :-420.0 Median :1.445e+09 Mode :character
## Mean :-344.6 Mean :1.446e+09
## 3rd Qu.:-300.0 3rd Qu.:1.451e+09
## Max. : 120.0 Max. :1.459e+09
## end_time_timezone end_time_offset X
## Length:711 Min. :-480.0 Mode:logical
## Class :character 1st Qu.:-480.0 NA's:711
## Mode :character Median :-420.0
## Mean :-343.8
## 3rd Qu.:-300.0
## Max. : 120.0
time <- strsplit(sleep$local_start_time, split = " ")
time <- data.frame(do.call(rbind, time))
time[,2] <- toupper(time[,2])
y <- apply(time, 1, function(x) paste(x, collapse = " "))
sleep$time <- as.POSIXct(y, tz = "America/Los_Angeles", format = "%d %b %Y %H:%M:%S")
names(sleep)
## [1] "local_start_time" "local_end_time" "heart_rate_avg"
## [4] "calories" "actual_minutes" "light_minutes"
## [7] "deep_minutes" "rem_minutes" "interruption_minutes"
## [10] "unknown_minutes" "interruptions" "toss_and_turn"
## [13] "start_timestamp" "start_time_iso" "start_time_timezone"
## [16] "start_time_offset" "end_timestamp" "end_time_iso"
## [19] "end_time_timezone" "end_time_offset" "X"
## [22] "time"
sleep[,c("time", "actual_minutes", "light_minutes", "deep_minutes", "rem_minutes", "interruption_minutes", "unknown_minutes")] %>%
gather(type, minutes, -time) %>%
ggplot(aes(x = time, y = minutes)) +
geom_line(aes(group = type, color = type))
## Warning: Removed 6 rows containing missing values (geom_path).
sleep$day <- format(sleep$time, format = "%Y-%m-%d")
sleep$day <- as.Date(sleep$time, format = "%Y-%m-%d")
sleep[,c("day", "actual_minutes", "light_minutes", "deep_minutes", "rem_minutes", "interruption_minutes", "unknown_minutes")] %>%
ddply("day", summarize,
actual = sum(actual_minutes),
light = sum(light_minutes),
deep = sum(deep_minutes),
rem = sum(rem_minutes),
interruption = sum(interruption_minutes),
unknown = sum(unknown_minutes)) %>%
gather(type, minutes, -day) %>%
ggplot(aes(x = day, y = minutes)) +
geom_line(aes(group = type, color = type)) +
theme_bw() +
labs(title = "BASIS sleep data")
## Warning: Removed 6 rows containing missing values (geom_path).
files <- list.files("wearables/Subject1_rawdata/Basis/MPS_B1/", full.names = TRUE)
head(files)
## [1] "wearables/Subject1_rawdata/Basis/MPS_B1//mpsnyder_20140523.csv"
## [2] "wearables/Subject1_rawdata/Basis/MPS_B1//mpsnyder_20140524.csv"
## [3] "wearables/Subject1_rawdata/Basis/MPS_B1//mpsnyder_20140525.csv"
## [4] "wearables/Subject1_rawdata/Basis/MPS_B1//mpsnyder_20140526.csv"
## [5] "wearables/Subject1_rawdata/Basis/MPS_B1//mpsnyder_20140527.csv"
## [6] "wearables/Subject1_rawdata/Basis/MPS_B1//mpsnyder_20140528.csv"
basis <- lapply(files[1], read.csv)
basis <- do.call(rbind, basis)
# Remove a big pile of NAs
basis <- basis[!is.na(basis$hr),]
time <- gsub("..7:00", "", basis$time)
basis$time <- as.POSIXct(time, tz = "America/Los_Angeles", format = "%Y-%m-%d %H:%M:%S")
ggplot(basis, aes(x = time, y = accel_magnitude)) +
geom_line() +
labs(title = "Basis data",
subtitle = "Acceleration") +
theme_bw()
## Warning: Removed 99 rows containing missing values (geom_path).