---
title: "Wearables EDA"
author: "Daniel Spakowicz"
date: "8/14/2017"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(plyr)
```

This is an exploratory data analysis of the mHealth data from:

> __Digital Health: Tracking Physiomes and Activity Using Wearable Biosensors Reveals Useful Health-Related Information__ _Xiao Li , Jessilyn Dunn , Denis Salins , Gao Zhou, Wenyu Zhou, Sophia Miryam Schüssler-Fiorenza Rose, Dalia Perelman, Elizabeth Colbert, Ryan Runge, Shannon Rego, Ria Sonecha, Somalee Datta, Tracey McLaughlin, Michael P. Snyder_ Published: January 12, 2017 https://doi.org/10.1371/journal.pbio.2001402

The argument that I’m trying to make is that the SeeClickFix or crime data are similar to those from the Snyder data. They have seasonal variability and lots of things changing, so it can be difficult to isolate the effect of an intervention. Let’s take the example of an exercise intervention, where individuals are asked to increase their activity and monitor the effect on some other outcome, like heart rate or weight. You can use the variability to which different individuals followed the intervention as a means to identify the signal from that intervention. 

This is similar to SeeClickFix and Crime, in that there are many different categories of posts, and they may effect crime to varying degrees. 

These are the devices for which data are available on the paper's website:

* Basis_Peak - a fitness tracker and heart rate monitor recently recalled for catching fire and burning users. [Engaget describes it](https://www.engadget.com/2016/08/09/basis-peak-obituary/) as, "an under-rated fitness tracker with great battery life but a terrible companion app."
* Basis_MPS_B1 - Looks like it's the same as Basis_Peak, but also includes accelerometer data and has many more time points. 
* MOVES - an app that tracks number of steps as well as the duration and location of movement. It estimates time spent running, walking, in transportation, biking, etc. A similar app called Human is also popular for iOS. 
* Masimo - a device targeted at patient monitoring; here the data include pulse and oxygen saturation.
* Radiation - RadTarge II, electronic personal dosimeter. Reports Hp(10), deep dose equivalents.
* Scanadu - Scout device, also measures heart rate and oxygen saturation over time.
* Withings - bathroom scale that reports weight, but there is also data for steps and active calories.
* ihealth-finger - finger clamp that measures pulse and oxygen saturation.

> HR and SpO2 data collected using four devices (Scanadu Scout, iHealth-finger, Masimo, and Basis) were very close to that of the WA instrument over a wide range of values using the Bland–Altman method of comparison [16,17]

So I'll compare the MOVES data, as an analog of SeeClickFix, to heart rate, collected by any of four devices. First I'll check out the MOVES data.

```{r MOVES}
files <- list.files("wearables/Subject1_rawdata/MOVES/", full.names = TRUE)
head(files)
moves <- read.csv(files)
```

There are a series of different activites with dates and duration time stamps. 

```{r}
summary(moves)
```

Each date has ~40 entries, comprised of different activities like cycling and running. I'll collapse all of the different activities into a single entry per day. 

```{r}
moves1 <- moves %>%
  ddply(c("Date", "Group"), summarize, 
          duration = sum(Duration),
          distance = sum(Distance),
          calories = sum(Calories))

# I'm going to exclude this for now, since "transport" is also in the list, and 
# the total amount of moving + transport time doesn't really make sense. I should do the total active time.
movesActive <- moves1[-grep("transport", moves1$Group),]

movesActive <- movesActive %>%
  ddply("Date", summarize,
          duration = sum(duration),
          distance = sum(distance),
          calories = sum(calories))
  
movesActive$Group <- "active"

moves2 <- rbind(moves1, movesActive)

moves2$Date <- as.Date(as.character(moves2$Date), format = "%m/%d/%y")

moves2 <- moves2[order(moves2$Date),]

moves2 %>%
  ggplot(aes(x = Date, y = duration)) +
    geom_line(aes(group = Group, color = Group))
```

Why is that so spikey? Maybe I have to smooth by collapsing to weeks? Also, what's that red line without a label? 

```{r}
table(moves2$Group)
moves2[grep("^$", moves2$Group),]
```

Those have a huge duration! Since I don't know what they are I'll remove them.

```{r}
moves2 <- moves2[-grep("^$", moves2$Group),]

moves2 %>%
  ggplot(aes(x = Date, y = duration)) +
    geom_line(aes(group = Group, color = Group))
```

Still too spikey. I'll aggregate to week.

```{r}
moves2$week <- format(moves2$Date, format = "%Y-%U")

moves3 <- moves2 %>%
  ddply(c("week", "Group"), summarize,
          duration = sum(duration),
          distance = sum(distance),
          calories = sum(calories))

moves3 %>%
  ggplot(aes(x = week, y = duration)) +
    geom_line(aes(group = Group, color = Group)) +
    theme(
      axis.text.x = element_blank()
    )
```

What if I exclude `transport`?

```{r}
moves_notrans <- moves3[-grep("transport", moves3$Group),]

pmoves <- moves_notrans %>%
  ggplot(aes(x = week, y = duration)) +
    geom_line(aes(group = Group, color = Group)) +
    theme(
      axis.text.x = element_blank()
    ) +
  labs(title = "MOVES data",
         subtitle = "Activity over time") 
pmoves
```

How about by month?

```{r}
moves2$month <- format(moves2$Date, format = "%Y-%m")

moves4 <- moves2 %>%
  ddply(c("month", "Group"), summarize,
          duration = sum(duration),
          distance = sum(distance),
          calories = sum(calories))

moves4 %>%
  ggplot(aes(x = month, y = duration)) +
    geom_line(aes(group = Group, color = Group)) +
    theme(
      axis.text.x = element_blank()
    ) +
  labs(title = "MOVES data",
         subtitle = "Activity over time")

```

I like the weeks best. 

Ok, now I need the heart rate information. First I'll check masimo.

```{r masimo}
files <- list.files("wearables/Subject1_rawdata/Masimo/", full.names = TRUE)
head(files)

masimo <- lapply(files, function(x) read.csv(x, skip = 1, header = FALSE, as.is = TRUE))

# Create a single dataframe
df <- do.call(rbind, masimo)

# Change column names
names(df) <- c("Session", "Index", "Timestamp", "Day", "Date", "Year", "Time",
                "SpO2", "Pulse Rate")

# Convert to long format and visualize
df[, c("Timestamp", "SpO2", "Pulse Rate")] %>%
  gather(measurement, value, SpO2, "Pulse Rate", -Timestamp) %>%
  ggplot(aes(x = Timestamp, y = value)) +
    geom_line(aes(group = measurement, color = measurement)) +
    labs(title = "Masimo data",
         subtitle = "Heart rate and Blood Oxygen Saturation over time")
```

Wow there are a lot of dropouts, I should remove outliers and view the mean per hour, or day or something like that. First I'll have to wrangle the time into an easier format. I wonder if this would be easier if I used one of the other device's data? I'll check a few others for easier date measurements.

```{r ihealth-finger}
files <- list.files("wearables/Subject1_rawdata/ihealth-finger/", full.names = TRUE)
head(files)

x <- read.table(files, sep = "\t", header = TRUE, as.is = TRUE)
head(x)
dim(x)

# Reformat columns
x$time <- as.POSIXct(x$TotalTime, tz = "America/Los_Angeles", format = "%m/%d/%y %H:%M")
x$SpO2 <- as.numeric(gsub("%", "", x$SpO2))
x$Pulse <- as.numeric(gsub(" bpm", "", x$Pulse))

x[, c("time", "SpO2", "Pulse")] %>%
  gather(measurement, value, SpO2, Pulse, -time) %>%
  ggplot(aes(x = time, y = value)) +
    geom_line(aes(group = measurement, color = measurement)) +
    labs(title = "ihealth-finger Data",
         subtitle = "Heart rate and Blood Oxygen Saturation over time")
```

There are about 100 fold fewer observations for ihealth finger than for Masimo. I'll go back to Masimo and clean it up a bit. 

```{r masimo 2}
# Format the date
df$datetime <- paste(df$Date, df$Year, df$Time, sep = ", ") %>%
            gsub(" Pacific Standard Time", "", .) %>%
            as.POSIXct(datetime, tz = "America/Los_Angeles", format = " %B %d, %Y, %I:%M:%S %p")  

dfp <- df[, c("datetime", "SpO2", "Pulse Rate")]

# Drop zeros
# sd3p <- which(dfp$Pulse < mean(dfp$Pulse) - sd(dfp$Pulse)*4)
sd3p <- which(dfp$Pulse < 20)
dfp$Pulse[sd3p] <- NA

# sd3s <- which(dfp$SpO2 < mean(dfp$SpO2) - sd(dfp$SpO2)*4)
sd3s <- which(dfp$SpO2 < 20)
dfp$SpO2[sd3s] <- NA

dfp %>%
  gather(measurement, value, SpO2, "Pulse Rate", -datetime, na.rm = TRUE) %>%
  ggplot(aes(x = datetime, y = value)) +
    geom_line(aes(group = measurement, color = measurement)) +
    labs(title = "Masimo data",
         subtitle = "Heart rate and Blood Oxygen Saturation over time")
```

Maybe the Withings data would be better? Weight vs activity would be easier for people to connect with?

```{r withings}
files <- list.files("wearables/Subject1_rawdata/Withings/", full.names = TRUE)
head(files)
withings <- readxl::read_xlsx(files[2])
```

Oh man that looks so clean. The date is already formatted correctly.

```{r}
withings %>%
  ggplot(aes(x = Date, y = Weight)) +
    geom_line() +
    theme_bw() +
    labs(title = "Withings data",
         subtitle = "Weight over time")
```

There are a few extremely low values -- I'll remove those

```{r}
withings2 <- withings[!(withings$Weight < 120 | withings$Weight > 153), ]

pwithings <- withings2 %>%
  ggplot(aes(x = Date, y = Weight)) +
    geom_line() +
    theme_bw() +
    labs(title = "Withings data",
         subtitle = "Weight over time")
pwithings  
```

```{r}
gridExtra::grid.arrange(pmoves, pwithings, ncol = 1)
```

```{r other withings}
files <- list.files("wearables/Subject1_rawdata/Withings", full.names = TRUE)
head(files)
withingso <- read_csv(files[1])

names(withingso)
withingso$Date <- as.Date(withingso$Date, format = "%m/%d/%y")

withingso %>%
  gather(measurement, value, -Date) %>%
  ggplot(aes(x = Date, y = value)) +
    geom_line(aes(group = measurement, color = measurement)) +
    labs(title = "Withings data",
         subtitle = "Steps and Active Calories over time")
```

```{r scanadu}
files <- list.files("wearables/Subject1_rawdata/Scanadu/", full.names = TRUE)
head(files)
scanadu <- read.table(files, sep = "\t", header = TRUE)
names(scanadu)

# Plot the number of readings per day for the devices reporting HR and SpO2
rbind(data.frame(table(scanadu$Data), device = "Scanadu"),
              data.frame(table(df$Date), device = "Masimo"),
              data.frame(table(x$Date), device = "iHealth-finger")) %>%
  ggplot(aes(Freq)) +
    geom_density(stat = "bin", aes(fill = device), alpha = 0.5) +
    labs(x = "Mesurements per day",
         fill = "Device",
         title = "Number of recorded measurements per day",
         subtitle = "The three devices for HR and SpO2") +
    theme_bw()
```

```{r radiation}
files <- list.files("wearables/Subject1_rawdata/Radiation/", full.names = TRUE)
head(files)
rad <- readxl::read_xlsx(files)
radlab <- rad[1,2]
rad <- rad[-1,]
names(rad) <- c("datetime", "rad")

rad$rad <- as.numeric(rad$rad)
rad$datetime <- as.POSIXct(rad$datetime, tz = "America/Los_Angeles", format = "%Y-%m-%d %H")

ggplot(rad, aes(x = datetime, y = rad)) +
  geom_line() +
  theme_bw() +
  labs(title = "RadTarge radiation measurements",
       y = radlab,
       x = "Date")


```

```{r basis peak}
files <- list.files("wearables/Subject1_rawdata/Basis/Basis_Peak/", full.names = TRUE)
head(files)
body <- read.csv(files[1])
body$date <- gsub("Z$", "", body$date)
head(body$date)

body$date <- as.POSIXct(body$date, tz = "America/Los_Angeles", format = "%Y-%m-%d %H:%M")

body[,c("date", "heart.rate", "skin.temp")] %>%
  gather(measurement, value, -date) %>%
  ggplot(aes(date, value)) +
    geom_line(aes(group = measurement, color = measurement)) +
    theme_bw() +
    labs(title = "Basis Data",
         subtitle = "Heart Rate and Skin Temperature")
```

```{r basis gsr}
body %>%
  ggplot(aes(date, gsr)) +
    geom_line() +
    theme_bw() +
    labs(title = "Basis Data",
         subtitle = "Galvanic Skin Response")

```


```{r basis sleep}
sleep <- read.csv(files[2], as.is = TRUE)
summary(sleep)

time <- strsplit(sleep$local_start_time, split = " ")
time <- data.frame(do.call(rbind, time))
time[,2] <- toupper(time[,2])
y <- apply(time, 1, function(x) paste(x, collapse = " "))
sleep$time <- as.POSIXct(y, tz = "America/Los_Angeles", format = "%d %b %Y %H:%M:%S")
names(sleep)

sleep[,c("time", "actual_minutes", "light_minutes", "deep_minutes", "rem_minutes", "interruption_minutes", "unknown_minutes")] %>%
  gather(type, minutes, -time) %>%
  ggplot(aes(x = time, y = minutes)) +
    geom_line(aes(group = type, color = type))
```

```{r sleep per day}
sleep$day <- format(sleep$time, format = "%Y-%m-%d")
sleep$day <- as.Date(sleep$time, format = "%Y-%m-%d")

sleep[,c("day", "actual_minutes", "light_minutes", "deep_minutes", "rem_minutes", "interruption_minutes", "unknown_minutes")] %>%
  ddply("day", summarize,
        actual = sum(actual_minutes),
        light = sum(light_minutes),
        deep = sum(deep_minutes),
        rem = sum(rem_minutes),
        interruption = sum(interruption_minutes),
        unknown = sum(unknown_minutes)) %>%
  gather(type, minutes, -day) %>%
  ggplot(aes(x = day, y = minutes)) +
    geom_line(aes(group = type, color = type)) +
    theme_bw() +
    labs(title = "BASIS sleep data")
```

```{r basis mps}
files <- list.files("wearables/Subject1_rawdata/Basis/MPS_B1/", full.names = TRUE)
head(files)
basis <- lapply(files[1], read.csv)
basis <- do.call(rbind, basis)
# Remove a big pile of NAs
basis <- basis[!is.na(basis$hr),]
time <- gsub("..7:00", "", basis$time)
basis$time <- as.POSIXct(time, tz = "America/Los_Angeles", format = "%Y-%m-%d %H:%M:%S")


ggplot(basis, aes(x = time, y = accel_magnitude)) +
  geom_line() +
  labs(title = "Basis data",
       subtitle = "Acceleration") +
  theme_bw()
```