wrangling

Author

Kelly Hatfield

Step 1: Opening Data

Working to find the data

library(here)
here() starts at /Users/kellymccormickhatfield/Documents/MADA 2023/kellyhatfield-MADA-portfolio
library (tidyverse)
── Attaching packages
───────────────────────────────────────
tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.3.0      ✔ stringr 1.5.0 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
raw_data <- readRDS(file = "/Users/kellymccormickhatfield/Documents/MADA 2023/kellyhatfield-MADA-portfolio/fluanalysis/Data/SympAct_Any_Pos.Rda")

Step 2: Wrangling Data

Part 1: Subsetting Data

  • Remove all variables that have Score or Total or FluA or FluB or Dxname or Activity in their name. 

  • Also remove the variable Unique.Visit. You should be left with 32 variables coding for presence or absence of some symptom. Only one, temperature, is continuous. A few have multiple categories.

  • Remove any NA observations, there aren’t many.

#List variable names

ls(raw_data)
 [1] "AbPain"            "ActivityLevel"     "ActivityLevelF"   
 [4] "BodyTemp"          "Breathless"        "ChestCongestion"  
 [7] "ChestPain"         "ChillsSweats"      "CoughIntensity"   
[10] "CoughYN"           "CoughYN2"          "Diarrhea"         
[13] "DxName1"           "DxName2"           "DxName3"          
[16] "DxName4"           "DxName5"           "EarPn"            
[19] "EyePn"             "Fatigue"           "Headache"         
[22] "Hearing"           "ImpactScore"       "ImpactScore2"     
[25] "ImpactScore2F"     "ImpactScore3"      "ImpactScore3F"    
[28] "ImpactScoreF"      "ImpactScoreFD"     "Insomnia"         
[31] "ItchyEye"          "Myalgia"           "MyalgiaYN"        
[34] "NasalCongestion"   "Nausea"            "PCRFluA"          
[37] "PCRFluB"           "Pharyngitis"       "RapidFluA"        
[40] "RapidFluB"         "RunnyNose"         "Sneeze"           
[43] "SubjectiveFever"   "SwollenLymphNodes" "ToothPn"          
[46] "TotalSymp1"        "TotalSymp1F"       "TotalSymp2"       
[49] "TotalSymp3"        "TransScore1"       "TransScore1F"     
[52] "TransScore2"       "TransScore2F"      "TransScore3"      
[55] "TransScore3F"      "TransScore4"       "TransScore4F"     
[58] "Unique.Visit"      "Vision"            "Vomit"            
[61] "Weakness"          "WeaknessYN"        "Wheeze"           
raw_data2 <- raw_data %>% select(-contains(c("Score","Total","FluA","FluB","Dxname", "Activity"))) %>% select(-contains(c("Unique.Visit")))

#Contains 32 variables. Yay!

#Remove missing observations
raw_data3 <- na.omit(raw_data2)
#Only dropped 5 observations. 

summary(raw_data3)
 SwollenLymphNodes ChestCongestion ChillsSweats NasalCongestion CoughYN  
 No :418           No :323         No :130      No :167         No : 75  
 Yes:312           Yes:407         Yes:600      Yes:563         Yes:655  
                                                                         
                                                                         
                                                                         
                                                                         
 Sneeze    Fatigue   SubjectiveFever Headache      Weakness   WeaknessYN
 No :339   No : 64   No :230         No :115   None    : 49   No : 49   
 Yes:391   Yes:666   Yes:500         Yes:615   Mild    :223   Yes:681   
                                               Moderate:338             
                                               Severe  :120             
                                                                        
                                                                        
  CoughIntensity CoughYN2      Myalgia    MyalgiaYN RunnyNose AbPain   
 None    : 47    No : 47   None    : 79   No : 79   No :211   No :639  
 Mild    :154    Yes:683   Mild    :213   Yes:651   Yes:519   Yes: 91  
 Moderate:357              Moderate:325                                
 Severe  :172              Severe  :113                                
                                                                       
                                                                       
 ChestPain Diarrhea  EyePn     Insomnia  ItchyEye  Nausea    EarPn    
 No :497   No :631   No :617   No :315   No :551   No :475   No :568  
 Yes:233   Yes: 99   Yes:113   Yes:415   Yes:179   Yes:255   Yes:162  
                                                                      
                                                                      
                                                                      
                                                                      
 Hearing   Pharyngitis Breathless ToothPn   Vision    Vomit     Wheeze   
 No :700   No :119     No :436    No :565   No :711   No :652   No :510  
 Yes: 30   Yes:611     Yes:294    Yes:165   Yes: 19   Yes: 78   Yes:220  
                                                                         
                                                                         
                                                                         
                                                                         
    BodyTemp     
 Min.   : 97.20  
 1st Qu.: 98.20  
 Median : 98.50  
 Mean   : 98.94  
 3rd Qu.: 99.30  
 Max.   :103.10  

Step 3: Manipulating Data

Categorical/Ordinal predictors

Deleting Repetitive Variables

summary(raw_data3)
 SwollenLymphNodes ChestCongestion ChillsSweats NasalCongestion CoughYN  
 No :418           No :323         No :130      No :167         No : 75  
 Yes:312           Yes:407         Yes:600      Yes:563         Yes:655  
                                                                         
                                                                         
                                                                         
                                                                         
 Sneeze    Fatigue   SubjectiveFever Headache      Weakness   WeaknessYN
 No :339   No : 64   No :230         No :115   None    : 49   No : 49   
 Yes:391   Yes:666   Yes:500         Yes:615   Mild    :223   Yes:681   
                                               Moderate:338             
                                               Severe  :120             
                                                                        
                                                                        
  CoughIntensity CoughYN2      Myalgia    MyalgiaYN RunnyNose AbPain   
 None    : 47    No : 47   None    : 79   No : 79   No :211   No :639  
 Mild    :154    Yes:683   Mild    :213   Yes:651   Yes:519   Yes: 91  
 Moderate:357              Moderate:325                                
 Severe  :172              Severe  :113                                
                                                                       
                                                                       
 ChestPain Diarrhea  EyePn     Insomnia  ItchyEye  Nausea    EarPn    
 No :497   No :631   No :617   No :315   No :551   No :475   No :568  
 Yes:233   Yes: 99   Yes:113   Yes:415   Yes:179   Yes:255   Yes:162  
                                                                      
                                                                      
                                                                      
                                                                      
 Hearing   Pharyngitis Breathless ToothPn   Vision    Vomit     Wheeze   
 No :700   No :119     No :436    No :565   No :711   No :652   No :510  
 Yes: 30   Yes:611     Yes:294    Yes:165   Yes: 19   Yes: 78   Yes:220  
                                                                         
                                                                         
                                                                         
                                                                         
    BodyTemp     
 Min.   : 97.20  
 1st Qu.: 98.20  
 Median : 98.50  
 Mean   : 98.94  
 3rd Qu.: 99.30  
 Max.   :103.10  
 #Delete Variables with yes/no observations that  are represented with a different variable
Fludata1 <- select(raw_data3,-c(CoughYN, WeaknessYN, CoughYN2, MyalgiaYN))

#Drop variables <50 Y/N

Fludata2 <- select(Fludata1, -c(Hearing, Vision))

Step 3: Save Wrangled Data as a RDS

here()
[1] "/Users/kellymccormickhatfield/Documents/MADA 2023/kellyhatfield-MADA-portfolio"
#Need to add some files to here


path <- here("fluanalysis","Data","CleanSymp.Rds")
saveRDS(raw_data3, file = path)

path <- here("fluanalysis", "Data", "FinalDataML.Rds")
saveRDS(Fludata2, file = path)