Wearables EDA

This is an exploratory data analysis of the mHealth data from:

Digital Health: Tracking Physiomes and Activity Using Wearable Biosensors Reveals Useful Health-Related Information Xiao Li , Jessilyn Dunn , Denis Salins , Gao Zhou, Wenyu Zhou, Sophia Miryam Schüssler-Fiorenza Rose, Dalia Perelman, Elizabeth Colbert, Ryan Runge, Shannon Rego, Ria Sonecha, Somalee Datta, Tracey McLaughlin, Michael P. Snyder Published: January 12, 2017 https://doi.org/10.1371/journal.pbio.2001402

The argument that I’m trying to make is that the SeeClickFix or crime data are similar to those from the Snyder data. They have seasonal variability and lots of things changing, so it can be difficult to isolate the effect of an intervention. Let’s take the example of an exercise intervention, where individuals are asked to increase their activity and monitor the effect on some other outcome, like heart rate. You can use the variability to which different individuals followed the intervention as a means to identify the signal from that intervention.

The two that I expect to relate are MOVES and heart rate.

Activity is likely to have an effect on heart rate, but are there some activities that have a larger effect than others? Can we figure out which is most effective at reducing heart rate by comparing how much the different activities are performed by different people?

This is similar to SeeClickFix and Crime, in that there are many different categories of posts, and they may effect crime to varying degrees.

I suppose I should just check a standard regression after an intervention and see if the amount of seeclickfix use was significant.

These are the devices for which data are available on the paper’s website:

Basis_Peak - a fitness tracker and heart rate monitor recently recalled for catching fire and burning users. Engaget describes it as, “an under-rated fitness tracker with great battery life but a terrible companion app.”
Basis_MPS_B1 - Looks like it’s the same as Basis_Peak, but also includes accelerometer data and has many more time points.
MOVES - an app that tracks number of steps as well as the duration and location of movement. It estimates time spent running, walking, in transportation, biking, etc. A similar app called Human is also popular for iOS.
Masimo - a device targeted at patient monitoring; here the data include pulse and oxygen saturation.
Radiation - RadTarge II, electronic personal dosimeter. Reports Hp(10), deep dose equivalents.
Scanadu - Scout device, also measures heart rate and oxygen saturation over time.
Withings - bathroom scale that reports weight, but there is also data for steps and active calories.
ihealth-finger - finger clamp that measures pulse and oxygen saturation.

HR and SpO2 data collected using four devices (Scanadu Scout, iHealth-finger, Masimo, and Basis) were very close to that of the WA instrument over a wide range of values using the Bland–Altman method of comparison [16,17]

So I’ll compare the MOVES data, as an analog of SeeClickFix, to heart rate, collected by any of four devices. First I’ll check out the MOVES data.

files <- list.files("wearables/Subject1_rawdata/MOVES/", full.names = TRUE)
head(files)

## [1] "wearables/Subject1_rawdata/MOVES//activities.csv"

moves <- read.csv(files)

There are a series of different activites with dates and duration time stamps.

summary(moves)

##        Date             Activity           Group      
##  3/18/15 :   48   walking   :14011            :    7  
##  7/24/14 :   43   transport : 1874   cycling  :  880  
##  8/23/14 :   43   cycling   :  880   running  :  184  
##  3/24/15 :   40   running   :  184   transport: 1923  
##  1/5/16  :   39   airplane  :   49   walking  :14011  
##  12/27/15:   39   basketball:    3                    
##  (Other) :16753   (Other)   :    4                    
##                        Start                              End       
##  2014-09-18T12:32:55-07:00:    2   2014-03-21T15:26:27-07:00:    1  
##  2014-03-21T15:26:22-07:00:    1   2014-03-21T16:04:14-07:00:    1  
##  2014-03-21T16:02:33-07:00:    1   2014-03-21T16:26:22-07:00:    1  
##  2014-03-21T16:05:20-07:00:    1   2014-03-21T16:30:20-07:00:    1  
##  2014-03-21T16:26:43-07:00:    1   2014-03-21T17:15:32-07:00:    1  
##  2014-03-21T17:15:04-07:00:    1   2014-03-21T18:19:28-07:00:    1  
##  (Other)                  :16998   (Other)                  :16999  
##     Duration          Distance            Steps           Calories    
##  Min.   :    0.0   Min.   :   0.000   Min.   :   0.0   Min.   :  0.0  
##  1st Qu.:   90.0   1st Qu.:   0.040   1st Qu.:  36.0   1st Qu.:  2.0  
##  Median :  203.0   Median :   0.138   Median : 173.0   Median :  8.0  
##  Mean   :  513.9   Mean   :  19.219   Mean   : 334.2   Mean   : 17.6  
##  3rd Qu.:  434.0   3rd Qu.:   0.367   3rd Qu.: 403.0   3rd Qu.: 18.0  
##  Max.   :50207.0   Max.   :7572.527   Max.   :8170.0   Max.   :567.0  
##

Each date has ~40 entries, comprised of different activities like cycling and running. I’ll collapse all of the different activities into a single entry per day.

moves1 <- moves %>%
  ddply(c("Date", "Group"), summarize, 
          duration = sum(Duration),
          distance = sum(Distance),
          calories = sum(Calories))

# I'm going to exclude this for now, since "transport" is also in the list, and 
# the total amount of moving + transport time doesn't really make sense. I should do the total active time.
movesActive <- moves1[-grep("transport", moves1$Group),]

movesActive <- movesActive %>%
  ddply("Date", summarize,
          duration = sum(duration),
          distance = sum(distance),
          calories = sum(calories))
  
movesActive$Group <- "active"

moves2 <- rbind(moves1, movesActive)

moves2$Date <- as.Date(as.character(moves2$Date), format = "%m/%d/%y")

moves2 <- moves2[order(moves2$Date),]

moves2 %>%
  ggplot(aes(x = Date, y = duration)) +
    geom_line(aes(group = Group, color = Group))

Why is that so spikey? Maybe I have to smooth by collapsing to weeks? Also, what’s that red line without a label?

table(moves2$Group)

## 
##             cycling   running transport   walking    active 
##         6       321       167       529       740       740

moves2[grep("^$", moves2$Group),]

##            Date Group duration distance calories
## 1647 2014-09-15           5100        0      351
## 1662 2014-09-18           2100        0      199
## 1666 2014-09-19           3900        0      323
## 167  2014-10-12           1200        0      140
## 553  2014-12-31           1800        0       93
## 1660 2015-09-17           2100        0      199

Those have a huge duration! Since I don’t know what they are I’ll remove them.

moves2 <- moves2[-grep("^$", moves2$Group),]

moves2 %>%
  ggplot(aes(x = Date, y = duration)) +
    geom_line(aes(group = Group, color = Group))

Still too spikey. I’ll aggregate to week.

moves2$week <- format(moves2$Date, format = "%Y-%U")

moves3 <- moves2 %>%
  ddply(c("week", "Group"), summarize,
          duration = sum(duration),
          distance = sum(distance),
          calories = sum(calories))

moves3 %>%
  ggplot(aes(x = week, y = duration)) +
    geom_line(aes(group = Group, color = Group)) +
    theme(
      axis.text.x = element_blank()
    )

What if I exclude transport?

moves_notrans <- moves3[-grep("transport", moves3$Group),]

pmoves <- moves_notrans %>%
  ggplot(aes(x = week, y = duration)) +
    geom_line(aes(group = Group, color = Group)) +
    theme(
      axis.text.x = element_blank()
    ) +
  labs(title = "MOVES data",
         subtitle = "Activity over time") 
pmoves

How about by month?

moves2$month <- format(moves2$Date, format = "%Y-%m")

moves4 <- moves2 %>%
  ddply(c("month", "Group"), summarize,
          duration = sum(duration),
          distance = sum(distance),
          calories = sum(calories))

moves4 %>%
  ggplot(aes(x = month, y = duration)) +
    geom_line(aes(group = Group, color = Group)) +
    theme(
      axis.text.x = element_blank()
    ) +
  labs(title = "MOVES data",
         subtitle = "Activity over time")

I like the weeks best.

Ok, now I need the heart rate information. First I’ll check masimo.

files <- list.files("wearables/Subject1_rawdata/Masimo/", full.names = TRUE)
head(files)

## [1] "wearables/Subject1_rawdata/Masimo//Subject1_MPSnyder_masimohealth_010516_to_011916.csv"
## [2] "wearables/Subject1_rawdata/Masimo//Subject1_MPSnyder_masimohealth_010516_to_012816.csv"
## [3] "wearables/Subject1_rawdata/Masimo//Subject1_MPSnyder_masimohealth_012816_to_021116.csv"
## [4] "wearables/Subject1_rawdata/Masimo//Subject1_MPSnyder_masimohealth_021216_to_021216.csv"
## [5] "wearables/Subject1_rawdata/Masimo//Subject1_MPSnyder_masimohealth_021716_to_021716.csv"
## [6] "wearables/Subject1_rawdata/Masimo//Subject1_MPSnyder_masimohealth_021916_to_030316.csv"

masimo <- lapply(files, function(x) read.csv(x, skip = 1, header = FALSE, as.is = TRUE))

# Create a single dataframe
df <- do.call(rbind, masimo)

# Change column names
names(df) <- c("Session", "Index", "Timestamp", "Day", "Date", "Year", "Time",
                "SpO2", "Pulse Rate")

# Convert to long format and visualize
df[, c("Timestamp", "SpO2", "Pulse Rate")] %>%
  gather(measurement, value, SpO2, "Pulse Rate", -Timestamp) %>%
  ggplot(aes(x = Timestamp, y = value)) +
    geom_line(aes(group = measurement, color = measurement)) +
    labs(title = "Masimo data",
         subtitle = "Heart rate and Blood Oxygen Saturation over time")

Wow there are a lot of dropouts, I should remove outliers and view the mean per hour, or day or something like that. First I’ll have to wrangle the time into an easier format. I wonder if this would be easier if I used one of the other device’s data? I’ll check a few others for easier date measurements.

files <- list.files("wearables/Subject1_rawdata/ihealth-finger/", full.names = TRUE)
head(files)

## [1] "wearables/Subject1_rawdata/ihealth-finger//ihealth.txt"

x <- read.table(files, sep = "\t", header = TRUE, as.is = TRUE)
head(x)

##      Date  Time     TotalTime SpO2   Pulse X
## 1 6/19/15 14:44 6/19/15 14:44  97% 101 bpm  
## 2 6/19/15 14:45 6/19/15 14:45  96%  90 bpm  
## 3 6/19/15 15:56 6/19/15 15:56  96%  98 bpm  
## 4 6/19/15 15:58 6/19/15 15:58  97%  93 bpm  
## 5 6/19/15 16:08 6/19/15 16:08  97%  93 bpm  
## 6 6/19/15 16:20 6/19/15 16:20  98%  95 bpm

dim(x)

## [1] 5151    6

# Reformat columns
x$time <- as.POSIXct(x$TotalTime, tz = "America/Los_Angeles", format = "%m/%d/%y %H:%M")
x$SpO2 <- as.numeric(gsub("%", "", x$SpO2))
x$Pulse <- as.numeric(gsub(" bpm", "", x$Pulse))

x[, c("time", "SpO2", "Pulse")] %>%
  gather(measurement, value, SpO2, Pulse, -time) %>%
  ggplot(aes(x = time, y = value)) +
    geom_line(aes(group = measurement, color = measurement)) +
    labs(title = "ihealth-finger Data",
         subtitle = "Heart rate and Blood Oxygen Saturation over time")

There are about 100 fold fewer observations for ihealth finger than for Masimo. I’ll go back to Masimo and clean it up a bit.

# Format the date
df$datetime <- paste(df$Date, df$Year, df$Time, sep = ", ") %>%
            gsub(" Pacific Standard Time", "", .) %>%
            as.POSIXct(datetime, tz = "America/Los_Angeles", format = " %B %d, %Y, %I:%M:%S %p")  

dfp <- df[, c("datetime", "SpO2", "Pulse Rate")]

# Drop zeros
# sd3p <- which(dfp$Pulse < mean(dfp$Pulse) - sd(dfp$Pulse)*4)
sd3p <- which(dfp$Pulse < 20)
dfp$Pulse[sd3p] <- NA

# sd3s <- which(dfp$SpO2 < mean(dfp$SpO2) - sd(dfp$SpO2)*4)
sd3s <- which(dfp$SpO2 < 20)
dfp$SpO2[sd3s] <- NA

dfp %>%
  gather(measurement, value, SpO2, "Pulse Rate", -datetime, na.rm = TRUE) %>%
  ggplot(aes(x = datetime, y = value)) +
    geom_line(aes(group = measurement, color = measurement)) +
    labs(title = "Masimo data",
         subtitle = "Heart rate and Blood Oxygen Saturation over time")

Maybe the Withings data would be better? Weight vs activity would be easier for people to connect with?

files <- list.files("wearables/Subject1_rawdata/Withings/", full.names = TRUE)
head(files)

## [1] "wearables/Subject1_rawdata/Withings//Withings_activity_and_calories.csv"
## [2] "wearables/Subject1_rawdata/Withings//Withings_weight.xlsx"

withings <- readxl::read_xlsx(files[2])

Oh man that looks so clean. The date is already formatted correctly.

withings %>%
  ggplot(aes(x = Date, y = Weight)) +
    geom_line() +
    theme_bw() +
    labs(title = "Withings data",
         subtitle = "Weight over time")

There are a few extremely low values – I’ll remove those

withings2 <- withings[!(withings$Weight < 120 | withings$Weight > 153), ]

pwithings <- withings2 %>%
  ggplot(aes(x = Date, y = Weight)) +
    geom_line() +
    theme_bw() +
    labs(title = "Withings data",
         subtitle = "Weight over time")
pwithings

gridExtra::grid.arrange(pmoves, pwithings, ncol = 1)

files <- list.files("wearables/Subject1_rawdata/Withings", full.names = TRUE)
head(files)

## [1] "wearables/Subject1_rawdata/Withings/Withings_activity_and_calories.csv"
## [2] "wearables/Subject1_rawdata/Withings/Withings_weight.xlsx"

withingso <- read_csv(files[1])

## Parsed with column specification:
## cols(
##   Date = col_character(),
##   Steps = col_integer(),
##   `Active calories` = col_integer()
## )

names(withingso)

## [1] "Date"            "Steps"           "Active calories"

withingso$Date <- as.Date(withingso$Date, format = "%m/%d/%y")

withingso %>%
  gather(measurement, value, -Date) %>%
  ggplot(aes(x = Date, y = value)) +
    geom_line(aes(group = measurement, color = measurement)) +
    labs(title = "Withings data",
         subtitle = "Steps and Active Calories over time")

## Warning: Removed 116 rows containing missing values (geom_path).

files <- list.files("wearables/Subject1_rawdata/Scanadu/", full.names = TRUE)
head(files)

## [1] "wearables/Subject1_rawdata/Scanadu//ScanaduData.tsv"

scanadu <- read.table(files, sep = "\t", header = TRUE)
names(scanadu)

## [1] "Data" "Time" "HR"   "SpO2"

# Plot the number of readings per day for the devices reporting HR and SpO2
rbind(data.frame(table(scanadu$Data), device = "Scanadu"),
              data.frame(table(df$Date), device = "Masimo"),
              data.frame(table(x$Date), device = "iHealth-finger")) %>%
  ggplot(aes(Freq)) +
    geom_density(stat = "bin", aes(fill = device), alpha = 0.5) +
    labs(x = "Mesurements per day",
         fill = "Device",
         title = "Number of recorded measurements per day",
         subtitle = "The three devices for HR and SpO2") +
    theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

files <- list.files("wearables/Subject1_rawdata/Radiation/", full.names = TRUE)
head(files)

## [1] "wearables/Subject1_rawdata/Radiation//Stanford_Wearables_Radiation_06072016.xlsx"

rad <- readxl::read_xlsx(files)
radlab <- rad[1,2]
rad <- rad[-1,]
names(rad) <- c("datetime", "rad")

rad$rad <- as.numeric(rad$rad)
rad$datetime <- as.POSIXct(rad$datetime, tz = "America/Los_Angeles", format = "%Y-%m-%d %H")

ggplot(rad, aes(x = datetime, y = rad)) +
  geom_line() +
  theme_bw() +
  labs(title = "RadTarge radiation measurements",
       y = radlab,
       x = "Date")

files <- list.files("wearables/Subject1_rawdata/Basis/Basis_Peak/", full.names = TRUE)
head(files)

## [1] "wearables/Subject1_rawdata/Basis/Basis_Peak//bodymetrics_to_03-31-16.csv"
## [2] "wearables/Subject1_rawdata/Basis/Basis_Peak//sleep_to_03-31-16.csv"

body <- read.csv(files[1])
body$date <- gsub("Z$", "", body$date)
head(body$date)

## [1] "2015-05-20 18:50" "2015-05-20 18:51" "2015-05-20 18:52"
## [4] "2015-05-20 18:53" "2015-05-20 18:54" "2015-05-20 18:55"

body$date <- as.POSIXct(body$date, tz = "America/Los_Angeles", format = "%Y-%m-%d %H:%M")

body[,c("date", "heart.rate", "skin.temp")] %>%
  gather(measurement, value, -date) %>%
  ggplot(aes(date, value)) +
    geom_line(aes(group = measurement, color = measurement)) +
    theme_bw() +
    labs(title = "Basis Data",
         subtitle = "Heart Rate and Skin Temperature")

## Warning: Removed 128 rows containing missing values (geom_path).

body %>%
  ggplot(aes(date, gsr)) +
    geom_line() +
    theme_bw() +
    labs(title = "Basis Data",
         subtitle = "Galvanic Skin Response")

## Warning: Removed 64 rows containing missing values (geom_path).

sleep <- read.csv(files[2], as.is = TRUE)
summary(sleep)

##  local_start_time   local_end_time     heart_rate_avg      calories       
##  Length:711         Length:711         Min.   : 51.15   Min.   :   1.964  
##  Class :character   Class :character   1st Qu.: 62.21   1st Qu.:  49.100  
##  Mode  :character   Mode  :character   Median : 67.35   Median : 147.900  
##                                        Mean   : 67.65   Mean   : 162.758  
##                                        3rd Qu.: 72.32   3rd Qu.: 244.700  
##                                        Max.   :106.18   Max.   :1331.600  
##  actual_minutes  light_minutes     deep_minutes     rem_minutes    
##  Min.   : 20.0   Min.   : 20.00   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.: 50.5   1st Qu.: 45.50   1st Qu.:  0.00   1st Qu.:  0.00  
##  Median :150.0   Median : 91.00   Median : 21.00   Median : 33.00  
##  Mean   :163.6   Mean   : 94.93   Mean   : 30.53   Mean   : 37.78  
##  3rd Qu.:248.0   3rd Qu.:131.50   3rd Qu.: 52.00   3rd Qu.: 65.00  
##  Max.   :632.0   Max.   :398.00   Max.   :168.00   Max.   :145.00  
##  interruption_minutes unknown_minutes  interruptions    toss_and_turn  
##  Min.   : 0.000       Min.   : 0.000   Min.   :0.0000   Min.   : 0.00  
##  1st Qu.: 0.000       1st Qu.: 0.000   1st Qu.:0.0000   1st Qu.: 4.00  
##  Median : 0.000       Median : 0.000   Median :0.0000   Median :10.00  
##  Mean   : 1.778       Mean   : 0.121   Mean   :0.3446   Mean   :11.74  
##  3rd Qu.: 0.000       3rd Qu.: 0.000   3rd Qu.:1.0000   3rd Qu.:17.00  
##  Max.   :28.000       Max.   :23.000   Max.   :4.0000   Max.   :63.00  
##  start_timestamp     start_time_iso     start_time_timezone
##  Min.   :1.433e+09   Length:711         Length:711         
##  1st Qu.:1.440e+09   Class :character   Class :character   
##  Median :1.445e+09   Mode  :character   Mode  :character   
##  Mean   :1.446e+09                                         
##  3rd Qu.:1.451e+09                                         
##  Max.   :1.459e+09                                         
##  start_time_offset end_timestamp       end_time_iso      
##  Min.   :-480.0    Min.   :1.433e+09   Length:711        
##  1st Qu.:-480.0    1st Qu.:1.440e+09   Class :character  
##  Median :-420.0    Median :1.445e+09   Mode  :character  
##  Mean   :-344.6    Mean   :1.446e+09                     
##  3rd Qu.:-300.0    3rd Qu.:1.451e+09                     
##  Max.   : 120.0    Max.   :1.459e+09                     
##  end_time_timezone  end_time_offset     X          
##  Length:711         Min.   :-480.0   Mode:logical  
##  Class :character   1st Qu.:-480.0   NA's:711      
##  Mode  :character   Median :-420.0                 
##                     Mean   :-343.8                 
##                     3rd Qu.:-300.0                 
##                     Max.   : 120.0

time <- strsplit(sleep$local_start_time, split = " ")
time <- data.frame(do.call(rbind, time))
time[,2] <- toupper(time[,2])
y <- apply(time, 1, function(x) paste(x, collapse = " "))
sleep$time <- as.POSIXct(y, tz = "America/Los_Angeles", format = "%d %b %Y %H:%M:%S")
names(sleep)

##  [1] "local_start_time"     "local_end_time"       "heart_rate_avg"      
##  [4] "calories"             "actual_minutes"       "light_minutes"       
##  [7] "deep_minutes"         "rem_minutes"          "interruption_minutes"
## [10] "unknown_minutes"      "interruptions"        "toss_and_turn"       
## [13] "start_timestamp"      "start_time_iso"       "start_time_timezone" 
## [16] "start_time_offset"    "end_timestamp"        "end_time_iso"        
## [19] "end_time_timezone"    "end_time_offset"      "X"                   
## [22] "time"

sleep[,c("time", "actual_minutes", "light_minutes", "deep_minutes", "rem_minutes", "interruption_minutes", "unknown_minutes")] %>%
  gather(type, minutes, -time) %>%
  ggplot(aes(x = time, y = minutes)) +
    geom_line(aes(group = type, color = type))

## Warning: Removed 6 rows containing missing values (geom_path).

sleep$day <- format(sleep$time, format = "%Y-%m-%d")
sleep$day <- as.Date(sleep$time, format = "%Y-%m-%d")

sleep[,c("day", "actual_minutes", "light_minutes", "deep_minutes", "rem_minutes", "interruption_minutes", "unknown_minutes")] %>%
  ddply("day", summarize,
        actual = sum(actual_minutes),
        light = sum(light_minutes),
        deep = sum(deep_minutes),
        rem = sum(rem_minutes),
        interruption = sum(interruption_minutes),
        unknown = sum(unknown_minutes)) %>%
  gather(type, minutes, -day) %>%
  ggplot(aes(x = day, y = minutes)) +
    geom_line(aes(group = type, color = type)) +
    theme_bw() +
    labs(title = "BASIS sleep data")

## Warning: Removed 6 rows containing missing values (geom_path).

files <- list.files("wearables/Subject1_rawdata/Basis/MPS_B1/", full.names = TRUE)
head(files)

## [1] "wearables/Subject1_rawdata/Basis/MPS_B1//mpsnyder_20140523.csv"
## [2] "wearables/Subject1_rawdata/Basis/MPS_B1//mpsnyder_20140524.csv"
## [3] "wearables/Subject1_rawdata/Basis/MPS_B1//mpsnyder_20140525.csv"
## [4] "wearables/Subject1_rawdata/Basis/MPS_B1//mpsnyder_20140526.csv"
## [5] "wearables/Subject1_rawdata/Basis/MPS_B1//mpsnyder_20140527.csv"
## [6] "wearables/Subject1_rawdata/Basis/MPS_B1//mpsnyder_20140528.csv"

basis <- lapply(files[1], read.csv)
basis <- do.call(rbind, basis)
# Remove a big pile of NAs
basis <- basis[!is.na(basis$hr),]
time <- gsub("..7:00", "", basis$time)
basis$time <- as.POSIXct(time, tz = "America/Los_Angeles", format = "%Y-%m-%d %H:%M:%S")


ggplot(basis, aes(x = time, y = accel_magnitude)) +
  geom_line() +
  labs(title = "Basis data",
       subtitle = "Acceleration") +
  theme_bw()

## Warning: Removed 99 rows containing missing values (geom_path).

Wearables EDA

Daniel Spakowicz

8/14/2017