Data Handling with R

Overview

  • Raw Vs Processed data
  • Reading data into R
  • Pre-processing
  • Summary analysis
  • Useful data sources

Raw Vs Processed data

  • Raw data:
    • Original source of data
    • Comes in wide varieties
    • Hard to use directly for data analysis exercise.
    • Needs to be processed.
  • Processed data
    • Ready for data analysis.
    • Processing involves transforming, subsetting, merging etc.
    • Processing should be performed as per set standards.
    • All processing steps should be recorded.
  • Ingredients of data analysis pipeline:
    • Raw data
    • Tidy (processed) data ready for analysis
    • Codebook describing each variable and its values in tidy dataset, other variables not in dataset, summary choices, experimental study design etc.
    • Explicit and exact step-by-step approach for data analysis against said objectives.

Reading data into R

  • Downloading from Internet:
    • download.file() function: download.file(url=”fileurl”, destfile=”filename”,method=”method-name”)
    • methods: curl, wget,lynx,internal
    #fileurl<-"https://data.gov.in/resources/weekly-wholesale-price-turarhar-dal-upto-2012/download"
    #download.file(url=fileurl, destfile="tur-dal-price-upto-2012.csv", method="auto")
    #list.files("./")
  • Reading local files: Use read.table(), read.csv() functions.
    dataset<-read.csv("fdata.csv")
    class(dataset)
    ## [1] "data.frame"
    dim(dataset)
    ## [1] 14931   239
    • Important parameters: sep, header, quote, na.strings, nrows, skip.

Contd…

  • There are many more methods in R to read from different data sources such as Excel, XML, JSON, MySQL, PostgreSQL, from web and APIs etc.

Pre-processing

  • Subsetting revisited
    • Using logical ANDs and ORs
    student_id<-c(1,2,3)
    student_names<-c("Ram","Shyam","Laxman")
    position<-c("First","Second","Third")
    data<-data.frame(student_id,student_names,position) #using data.frame() function
    data[data$student_id>=2 & data$position=="Third",]
    ##   student_id student_names position
    ## 3          3        Laxman    Third

Contd…

  • Sorting and ordering: by using sort() and order() function.
    sort(data$student_names)
    ## [1] Laxman Ram    Shyam 
    ## Levels: Laxman Ram Shyam
    data[order(data$student_names),]
    ##   student_id student_names position
    ## 3          3        Laxman    Third
    ## 1          1           Ram    First
    ## 2          2         Shyam   Second

Contd…

  • Handling with missing values: NA – missing value, NaN – undefined mathematical expressions
x<-c(1,2,NA,20,55,NaN)
#checking for NAs,NaNs
is.na(x)
## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE
is.nan(x)
## [1] FALSE FALSE FALSE FALSE FALSE  TRUE

Contd…

#removing NAs
bad<-is.na(x)
x[!bad]
## [1]  1  2 20 55
#taking subset with no missing values
good<-complete.cases(x) #returns all complete cases with no NAs.
good
## [1]  TRUE  TRUE FALSE  TRUE  TRUE FALSE
x[good]
## [1]  1  2 20 55

Contd…

  • Reshaping data: data needs to be changed from one format to other.
    • Using reshape2 package:
    library(reshape2)
    ## Warning: package 'reshape2' was built under R version 3.0.3
    #converts to flat format, unique id-variable combination
    mdata<-melt(data)
    ## Using student_names, position as id variables
    mdata
    ##   student_names position   variable value
    ## 1           Ram    First student_id     1
    ## 2         Shyam   Second student_id     2
    ## 3        Laxman    Third student_id     3

Contd…

dcast(mdata, student_names~variable) #casts a molten data frame to a data frame or array
##   student_names student_id
## 1        Laxman          3
## 2           Ram          1
## 3         Shyam          2
split(data,data$student_id) #splits data into groups
## $`1`
##   student_id student_names position
## 1          1           Ram    First
## 
## $`2`
##   student_id student_names position
## 2          2         Shyam   Second
## 
## $`3`
##   student_id student_names position
## 3          3        Laxman    Third

Contd…

#adding a new variable
data$year<-c(2015,2015,2015)
data
##   student_id student_names position year
## 1          1           Ram    First 2015
## 2          2         Shyam   Second 2015
## 3          3        Laxman    Third 2015

Contd…

  • Another important package for reshaping is plyr (split-apply-combine paradign for R).
library(plyr)
## Warning: package 'plyr' was built under R version 3.0.3
#ddply() function - takes data frame is input, returns a data frame
ddply(data,c(student_id),count)
##   student_id student_names position year freq
## 1          1           Ram    First 2015    1
## 2          2         Shyam   Second 2015    1
## 3          3        Laxman    Third 2015    1
  • Merging – merge(), intersect() etc.

Summary analysis

  • Datasets often very large. Its important to collect summary statistics
dim(ToothGrowth) #dimensions of dataset
## [1] 60  3
head(ToothGrowth) #shows first part of dataset. try tail().
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

Contd…

summary(ToothGrowth) #reports summary of dataset
##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000
str(ToothGrowth) #more information
## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

Contd…

#computes summary statistics on subsets of data
aggregate(ToothGrowth,by=list(ToothGrowth$dose),class)
##   Group.1     len   supp    dose
## 1     0.5 numeric factor numeric
## 2     1.0 numeric factor numeric
## 3     2.0 numeric factor numeric

Contd…

quantile(ToothGrowth$len, na.rm=TRUE) #quantiles
##     0%    25%    50%    75%   100% 
##  4.200 13.075 19.250 25.275 33.900
table(ToothGrowth$dose, ToothGrowth$supp) #tabulate data based on parameters
##      
##       OJ VC
##   0.5 10 10
##   1   10 10
##   2   10 10
object.size(ToothGrowth) #size of dataset
## 2568 bytes

Contd…

mean(ToothGrowth$len) #mean
## [1] 18.81333
median(ToothGrowth$len) #median
## [1] 19.25
var(ToothGrowth$len) #variance
## [1] 58.51202
sd(ToothGrowth$len) #standard deviation
## [1] 7.649315

Contd…

range(ToothGrowth$len) #range
## [1]  4.2 33.9
  • Some other important functions to try: xtabs(), ftable(), prop.table(), margin.table() etc.

Simulation, sequencing and sampling

  • Simulation: Useful for inferencing results from data analysis
    • Functions for probability (normal) distribution: rnorm(), dnorm(), pnorm(), qnorm()
    • r – randon no. generation, d – density, p – cummulative distribution, q – quantile
set.seed(3) #sets random no. seed
x<-rnorm(5)
x
## [1] -0.9619334 -0.2925257  0.2587882 -1.1521319  0.1957828
summary(x)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.1520 -0.9619 -0.2925 -0.3904  0.1958  0.2588

Contd…

 

References