Data Handling with R

Overview

  • Raw Vs Processed data
  • Reading data into R
  • Pre-processing
  • Summary analysis
  • Useful data sources

Raw Vs Processed data

  • Raw data:
    • Original source of data
    • Comes in wide varieties
    • Hard to use directly for data analysis exercise.
    • Needs to be processed.
  • Processed data
    • Ready for data analysis.
    • Processing involves transforming, subsetting, merging etc.
    • Processing should be performed as per set standards.
    • All processing steps should be recorded.
  • Ingredients of data analysis pipeline:
    • Raw data
    • Tidy (processed) data ready for analysis
    • Codebook describing each variable and its values in tidy dataset, other variables not in dataset, summary choices, experimental study design etc.
    • Explicit and exact step-by-step approach for data analysis against said objectives.

Reading data into R

  • Downloading from Internet:
    • download.file() function: download.file(url=”fileurl”, destfile=”filename”,method=”method-name”)
    • methods: curl, wget,lynx,internal
    #fileurl<-"https://data.gov.in/resources/weekly-wholesale-price-turarhar-dal-upto-2012/download"
    #download.file(url=fileurl, destfile="tur-dal-price-upto-2012.csv", method="auto")
    #list.files("./")
  • Reading local files: Use read.table(), read.csv() functions.
    dataset<-read.csv("fdata.csv")
    class(dataset)
    ## [1] "data.frame"
    dim(dataset)
    ## [1] 14931   239
    • Important parameters: sep, header, quote, na.strings, nrows, skip.

Contd…

  • There are many more methods in R to read from different data sources such as Excel, XML, JSON, MySQL, PostgreSQL, from web and APIs etc.

Pre-processing

  • Subsetting revisited
    • Using logical ANDs and ORs
    student_id<-c(1,2,3)
    student_names<-c("Ram","Shyam","Laxman")
    position<-c("First","Second","Third")
    data<-data.frame(student_id,student_names,position) #using data.frame() function
    data[data$student_id>=2 & data$position=="Third",]
    ##   student_id student_names position
    ## 3          3        Laxman    Third

Contd…

  • Sorting and ordering: by using sort() and order() function.
    sort(data$student_names)
    ## [1] Laxman Ram    Shyam 
    ## Levels: Laxman Ram Shyam
    data[order(data$student_names),]
    ##   student_id student_names position
    ## 3          3        Laxman    Third
    ## 1          1           Ram    First
    ## 2          2         Shyam   Second

Contd…

  • Handling with missing values: NA – missing value, NaN – undefined mathematical expressions
x<-c(1,2,NA,20,55,NaN)
#checking for NAs,NaNs
is.na(x)
## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE
is.nan(x)
## [1] FALSE FALSE FALSE FALSE FALSE  TRUE

Contd…

#removing NAs
bad<-is.na(x)
x[!bad]
## [1]  1  2 20 55
#taking subset with no missing values
good<-complete.cases(x) #returns all complete cases with no NAs.
good
## [1]  TRUE  TRUE FALSE  TRUE  TRUE FALSE
x[good]
## [1]  1  2 20 55

Contd…

  • Reshaping data: data needs to be changed from one format to other.
    • Using reshape2 package:
    library(reshape2)
    ## Warning: package 'reshape2' was built under R version 3.0.3
    #converts to flat format, unique id-variable combination
    mdata<-melt(data)
    ## Using student_names, position as id variables
    mdata
    ##   student_names position   variable value
    ## 1           Ram    First student_id     1
    ## 2         Shyam   Second student_id     2
    ## 3        Laxman    Third student_id     3

Contd…

dcast(mdata, student_names~variable) #casts a molten data frame to a data frame or array
##   student_names student_id
## 1        Laxman          3
## 2           Ram          1
## 3         Shyam          2
split(data,data$student_id) #splits data into groups
## $`1`
##   student_id student_names position
## 1          1           Ram    First
## 
## $`2`
##   student_id student_names position
## 2          2         Shyam   Second
## 
## $`3`
##   student_id student_names position
## 3          3        Laxman    Third

Contd…

#adding a new variable
data$year<-c(2015,2015,2015)
data
##   student_id student_names position year
## 1          1           Ram    First 2015
## 2          2         Shyam   Second 2015
## 3          3        Laxman    Third 2015

Contd…

  • Another important package for reshaping is plyr (split-apply-combine paradign for R).
library(plyr)
## Warning: package 'plyr' was built under R version 3.0.3
#ddply() function - takes data frame is input, returns a data frame
ddply(data,c(student_id),count)
##   student_id student_names position year freq
## 1          1           Ram    First 2015    1
## 2          2         Shyam   Second 2015    1
## 3          3        Laxman    Third 2015    1
  • Merging – merge(), intersect() etc.

Summary analysis

  • Datasets often very large. Its important to collect summary statistics
dim(ToothGrowth) #dimensions of dataset
## [1] 60  3
head(ToothGrowth) #shows first part of dataset. try tail().
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

Contd…

summary(ToothGrowth) #reports summary of dataset
##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000
str(ToothGrowth) #more information
## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

Contd…

#computes summary statistics on subsets of data
aggregate(ToothGrowth,by=list(ToothGrowth$dose),class)
##   Group.1     len   supp    dose
## 1     0.5 numeric factor numeric
## 2     1.0 numeric factor numeric
## 3     2.0 numeric factor numeric

Contd…

quantile(ToothGrowth$len, na.rm=TRUE) #quantiles
##     0%    25%    50%    75%   100% 
##  4.200 13.075 19.250 25.275 33.900
table(ToothGrowth$dose, ToothGrowth$supp) #tabulate data based on parameters
##      
##       OJ VC
##   0.5 10 10
##   1   10 10
##   2   10 10
object.size(ToothGrowth) #size of dataset
## 2568 bytes

Contd…

mean(ToothGrowth$len) #mean
## [1] 18.81333
median(ToothGrowth$len) #median
## [1] 19.25
var(ToothGrowth$len) #variance
## [1] 58.51202
sd(ToothGrowth$len) #standard deviation
## [1] 7.649315

Contd…

range(ToothGrowth$len) #range
## [1]  4.2 33.9
  • Some other important functions to try: xtabs(), ftable(), prop.table(), margin.table() etc.

Simulation, sequencing and sampling

  • Simulation: Useful for inferencing results from data analysis
    • Functions for probability (normal) distribution: rnorm(), dnorm(), pnorm(), qnorm()
    • r – randon no. generation, d – density, p – cummulative distribution, q – quantile
set.seed(3) #sets random no. seed
x<-rnorm(5)
x
## [1] -0.9619334 -0.2925257  0.2587882 -1.1521319  0.1957828
summary(x)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.1520 -0.9619 -0.2925 -0.3904  0.1958  0.2588

Contd…

 

References

Getting started with R

Overview

  • What is R?
  • R’s correspondence with S
  • R features
  • Useful URLs
  • Installing R, RStudio
  • R and Statistics
  • Using R – Getting Started

What is R?

Contd…

  • Useful R books:
    • R in Action by Robert I. Kabacoff. Pub.: Manning Publications
    • Statistical Analysis with R by John M. Quick. Pub.: PACKT Publishing
    • Many more R e-books available through Books24X7 (available to CDAC through MCIT consortium).

Contd…

Contd…

  • R and statistics:
    • A comprehensive statistical platform providing all sorts of data analytics techniques.
    • Strong graphics capabilities to visualize complex data.
    • Designed to support interactive data analysis and exploration.
    • Capable of reading data from variety of sources.
    • Facility to program new statistical methods and packages.
  • Some disadvantages too…
    • Objects stored in primary memory. May impose performance bottlenecks in case of large datasets.
    • No provision of built-in dynamic or 3D graphics. But external packages like plot3D, scatterplot3D etc. available.
    • Similarly, no built-in support for web-based processing. Can be done through third-party packages.
    • Functionality scattered among packages

Using R – Getting started

  • Launch R Interface/RStudio depending on your platform.
  • Utility commands/functions:
    • setwd() – sets working directory.
    setwd("C:/RDemo1")
    • getwd() – gets current working directory.
    getwd()
    ## [1] "C:/RDemo1"
    • dir() – lists the contents of current working directory.
    dir()
    ## [1] "R-Basics.html" "R-Basics.Rmd"
    • ls() – lists names of objects in R environment
    ls()
    ## [1] "metadata"

Contd…

  • help.start() – provides general help.
  • help(“foo”) or ?foo – help on function “foo”. For ex. help(“mean”) or ?mean.
  • help.search(“foo”) or ??foo – search for string “foo” in help system. For ex. help.search(“mean”) or ??mean
  • example(“foo”) – shows examples of function “foo”.
    example("mean")
    ## 
    ## mean> x <- c(0:10, 50)
    ## 
    ## mean> xm <- mean(x)
    ## 
    ## mean> c(xm, mean(x, trim = 0.10))
    ## [1] 8.75 5.50
  • data() – lists all example datasets in currently loaded packages.
  • library() – lists all available packages

Contd…

  • data(foo) – loads dataset “foo” in R. For ex. data(mtcars)
  • library(foo) – load package “foo” in R. For ex. library(plyr).
  • rm(objectlist) – removes one or more objects from R workspace.
  • options() – shows/sets current options for workspace.
  • history(#) – lists last # commands. default 25.
  • install.packages(“foo”) – installs package “foo”. For ex. install.packages(“reshape2”).
  • help(package=”package-name”) – provides brief description of package, an index of functions and datasets in package.
  • print(x) or x- print obejct ‘x’ on terminal.
  • q() – quits current R session.

Using R – Data types

  • Five basic types in R are – character, numeric, integer, complex, logical(true/false).
  • Common data objects are – vector, matrix, list, factor, data frame, table.
  • Creating and assigning to a variable:
x<-1
  • Checking the type of variable:
class(x)
## [1] "numeric"

Contd…

  • Printing a variable:
x #auto-printing
## [1] 1
print(x) #explicit printing
## [1] 1
  • Creating Vector: contains objects of same class.
x<-c(1,2,3) #using c() function
y<-vector("logical", length=10) #using vector() function
length(x) #length of vector x
## [1] 3

Contd…

  • Vector operations: Various arithmetic operations can be performed member-wise.
y<-c(4,5,6)
5*x #multiplication by a scalar
## [1]  5 10 15
x+y #addition of two vectors
## [1] 5 7 9
x*y #multiplication of two vectors
## [1]  4 10 18
x^y #x to the power y
## [1]   1  32 729

Contd…

  • Creating Matrix: Two-dimensional array having elements of same class.
m<-matrix(c(1,2,3,11,12,13), nrow=2,ncol=3) #using matrix() function.
m
##      [,1] [,2] [,3]
## [1,]    1    3   12
## [2,]    2   11   13
dim(m) #dimensions of matrix m
## [1] 2 3
attributes(m) #attributes of matrix m
## $dim
## [1] 2 3

Contd…

  • By default, elements in matrix are filled by column. “byrow” attribute of matrix() can be used to fill elements by row.
m<-matrix(c(1,2,3,11,12,13), nrow=2,ncol=3, byrow = TRUE)
m
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]   11   12   13

Contd…

  • cbind-ing and rbind-ing: By using cbind() and rbind() functions
x<-c(1,2,3)
y<-c(11,12,13)
cbind(x,y)
##      x  y
## [1,] 1 11
## [2,] 2 12
## [3,] 3 13
rbind(x,y)
##   [,1] [,2] [,3]
## x    1    2    3
## y   11   12   13

Contd…

  • Matrix operations/functions:
p<-3*m #multiplication by a scalar
n<-matrix(c(4,5,6,14,15,16), nrow=2,ncol=3)
q<-m+n #addition of two matrices
o<-matrix(c(4,5,6,14,15,16), nrow=3,ncol=2)
r<-m %*% o #matrix multiplication by using %*%
mdash<-t(m) #transpose of matrix
s<-matrix(c(4,5,6,14,15,16,24,25,26), nrow=3,ncol=3,
          byrow=TRUE)
s_det<-det(s) #determinant of s
m_row_sum<-rowSums(m)
m_col_sum<-colSums(m)

Contd…

p
##      [,1] [,2] [,3]
## [1,]    3    6    9
## [2,]   33   36   39
q
##      [,1] [,2] [,3]
## [1,]    5    8   18
## [2,]   16   26   29
r
##      [,1] [,2]
## [1,]   32   92
## [2,]  182  542

Contd…

mdash
##      [,1] [,2]
## [1,]    1   11
## [2,]    2   12
## [3,]    3   13
s_det
## [1] 1.110223e-14
m_row_sum
## [1]  6 36
m_col_sum
## [1] 12 14 16

Contd…

  • List: A special type of vector containing elements of different classes
x<-list(1,"p",TRUE,2+4i) #using list() function
x
## [[1]]
## [1] 1
## 
## [[2]]
## [1] "p"
## 
## [[3]]
## [1] TRUE
## 
## [[4]]
## [1] 2+4i

Contd…

  • Factor: Represents categorical data. Can be ordered or unordered.
    status<-c("low","high","medium","high","low")
    x<-factor(status, ordered=TRUE,
            levels=c("low","medium","high")) #using factor() function
    x
    ## [1] low    high   medium high   low   
    ## Levels: low < medium < high
    • ‘levels’ argument is used to set the order of levels.
    • First level forms the baseline level.
    • Without any order, levels are called nominal. Ex. – Type1, Type2, …
    • With order, levels are called ordinal. Ex. – low, medium, …

Contd…

  • Data frame: Used to store tabular data. Can contain different classes
student_id<-c(1,2,3)
student_names<-c("Ram","Shyam","Laxman")
position<-c("First","Second","Third")
data<-data.frame(student_id,student_names,position) #using data.frame() function
data
##   student_id student_names position
## 1          1           Ram    First
## 2          2         Shyam   Second
## 3          3        Laxman    Third
data$student_id #accessing a particular column
## [1] 1 2 3

Contd…

nrow(data) #no. of rows in data
## [1] 3
ncol(data) #no. of columns in data
## [1] 3
names(data) #column names of data
## [1] "student_id"    "student_names" "position"

Using R – Control structures

  • R provides all types of control structures: if-else, for, while, repeat, break, next, return.
  • Mainly used within functions/scripts.
x<-5
if(x > 7) #if-else structure
  y<-TRUE else
    y<-FALSE
y
## [1] FALSE
for(i in 1:10) #for loop
  print(i)
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10

Contd…

count<-0
while(count < 10) #while loop
  count<-count+1
count
## [1] 10
  • repeat is used to create an infinite loop. It can be terminated only through a call to break.
  • next is used to skip an interation in a loop.
  • return is used to return a value from a function.

Using R – looping functions

  • These functions can be used loop over various type of objects.
  • lapply – loop over a list and evaluate a function on each element.
  • sapply – same as lapply but try to simplify the result.
  • apply – apply a function over the margins of an array
  • tapply – apply a function over the subsets of a vector
x<-list(a=1:5,b=rnorm(20))
lapply(x,sum) #lapply returns a list
## $a
## [1] 15
## 
## $b
## [1] -0.8801658

Contd…

x<-matrix(c(1,2,3,11,12,13), nrow=2, ncol=3,byrow=TRUE)
# MARGIN=1 for rows, MARGIN=2 for columns
apply(x,MARGIN=1,FUN=sum)
## [1]  6 36
y<-c(rnorm(20),runif(20),rnorm(20,1))
f<-gl(3,20) #generate factor levels as per given pattern
tapply(y,f,mean)
##          1          2          3 
## -0.2668254  0.5382292  0.9893389

Using R – Subsetting

  • Refers to extract sub-segment of data from R objects.
  • Important while working with large datasets.
  • There are various operators.
  • [ used to extract the object of same class as original generally from a vector or matrix.
  • [[ used to extract elements of a list or data frame.
  • $ used to extract elements from a list or data frame by name.
x<-c(1,2,3,4)
x[2]
## [1] 2
x[1:3]
## [1] 1 2 3

Contd…

  • Subsetting a matrix:
x<-matrix(c(1,2,3,11,12,13), nrow=2, ncol=3,byrow=TRUE)
x[1,2]
## [1] 2
x[1,]
## [1] 1 2 3
x[,2]
## [1]  2 12

Contd…

  • Subsetting a list:
x<-list(a=1,b="p",c=TRUE,d=2+4i)
x[[1]]
## [1] 1
x$d
## [1] 2+4i
x[["c"]]
## [1] TRUE
x["b"]
## $b
## [1] "p"

Contd…

  • Subsetting a data frame
data[1,]
##   student_id student_names position
## 1          1           Ram    First
data$student_names
## [1] Ram    Shyam  Laxman
## Levels: Laxman Ram Shyam
data[data$position=="Second",]
##   student_id student_names position
## 2          2         Shyam   Second
  • Using logical ANDs and ORs
    data[data$student_id>=2 & data$position=="Third",]
    ##   student_id student_names position
    ## 3          3        Laxman    Third

Using R – Functions

  • Created using the function() directive.
  • Can be passed as arguments to other functions. Can be nested.
  • Return value is the last expression to be evaluated inside function body.
  • Have named arguments with default values.
  • Some arguments can be missing during function calls.
add<-function(a=1,b=2,c=3) {
   s = a+b+c
   print(s)
  }
add()
## [1] 6
add(10,11,12)
## [1] 33
add(10)
## [1] 15

R Source files

  • Should be saved/created with .R extension.
  • Can be used to store functions, commands required to be executed sequentially etc.
  • source() function used to load such R scripts into R workspace.
source("C:/RDemo/test.R")
add()
## [1] 6

Contd…

source("C:/RDemo/test1.R", echo=T)
## 
## > x <- 1
## 
## > y <- 2
## 
## > x + y
## [1] 3
source("C:/RDemo/test1.R", print.eval=T)
## [1] 3

References