Data Handling with R

Overview

  • Raw Vs Processed data
  • Reading data into R
  • Pre-processing
  • Summary analysis
  • Useful data sources

Raw Vs Processed data

  • Raw data:
    • Original source of data
    • Comes in wide varieties
    • Hard to use directly for data analysis exercise.
    • Needs to be processed.
  • Processed data
    • Ready for data analysis.
    • Processing involves transforming, subsetting, merging etc.
    • Processing should be performed as per set standards.
    • All processing steps should be recorded.
  • Ingredients of data analysis pipeline:
    • Raw data
    • Tidy (processed) data ready for analysis
    • Codebook describing each variable and its values in tidy dataset, other variables not in dataset, summary choices, experimental study design etc.
    • Explicit and exact step-by-step approach for data analysis against said objectives.

Reading data into R

  • Downloading from Internet:
    • download.file() function: download.file(url=”fileurl”, destfile=”filename”,method=”method-name”)
    • methods: curl, wget,lynx,internal
    #fileurl<-"https://data.gov.in/resources/weekly-wholesale-price-turarhar-dal-upto-2012/download"
    #download.file(url=fileurl, destfile="tur-dal-price-upto-2012.csv", method="auto")
    #list.files("./")
  • Reading local files: Use read.table(), read.csv() functions.
    dataset<-read.csv("fdata.csv")
    class(dataset)
    ## [1] "data.frame"
    dim(dataset)
    ## [1] 14931   239
    • Important parameters: sep, header, quote, na.strings, nrows, skip.

Contd…

  • There are many more methods in R to read from different data sources such as Excel, XML, JSON, MySQL, PostgreSQL, from web and APIs etc.

Pre-processing

  • Subsetting revisited
    • Using logical ANDs and ORs
    student_id<-c(1,2,3)
    student_names<-c("Ram","Shyam","Laxman")
    position<-c("First","Second","Third")
    data<-data.frame(student_id,student_names,position) #using data.frame() function
    data[data$student_id>=2 & data$position=="Third",]
    ##   student_id student_names position
    ## 3          3        Laxman    Third

Contd…

  • Sorting and ordering: by using sort() and order() function.
    sort(data$student_names)
    ## [1] Laxman Ram    Shyam 
    ## Levels: Laxman Ram Shyam
    data[order(data$student_names),]
    ##   student_id student_names position
    ## 3          3        Laxman    Third
    ## 1          1           Ram    First
    ## 2          2         Shyam   Second

Contd…

  • Handling with missing values: NA – missing value, NaN – undefined mathematical expressions
x<-c(1,2,NA,20,55,NaN)
#checking for NAs,NaNs
is.na(x)
## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE
is.nan(x)
## [1] FALSE FALSE FALSE FALSE FALSE  TRUE

Contd…

#removing NAs
bad<-is.na(x)
x[!bad]
## [1]  1  2 20 55
#taking subset with no missing values
good<-complete.cases(x) #returns all complete cases with no NAs.
good
## [1]  TRUE  TRUE FALSE  TRUE  TRUE FALSE
x[good]
## [1]  1  2 20 55

Contd…

  • Reshaping data: data needs to be changed from one format to other.
    • Using reshape2 package:
    library(reshape2)
    ## Warning: package 'reshape2' was built under R version 3.0.3
    #converts to flat format, unique id-variable combination
    mdata<-melt(data)
    ## Using student_names, position as id variables
    mdata
    ##   student_names position   variable value
    ## 1           Ram    First student_id     1
    ## 2         Shyam   Second student_id     2
    ## 3        Laxman    Third student_id     3

Contd…

dcast(mdata, student_names~variable) #casts a molten data frame to a data frame or array
##   student_names student_id
## 1        Laxman          3
## 2           Ram          1
## 3         Shyam          2
split(data,data$student_id) #splits data into groups
## $`1`
##   student_id student_names position
## 1          1           Ram    First
## 
## $`2`
##   student_id student_names position
## 2          2         Shyam   Second
## 
## $`3`
##   student_id student_names position
## 3          3        Laxman    Third

Contd…

#adding a new variable
data$year<-c(2015,2015,2015)
data
##   student_id student_names position year
## 1          1           Ram    First 2015
## 2          2         Shyam   Second 2015
## 3          3        Laxman    Third 2015

Contd…

  • Another important package for reshaping is plyr (split-apply-combine paradign for R).
library(plyr)
## Warning: package 'plyr' was built under R version 3.0.3
#ddply() function - takes data frame is input, returns a data frame
ddply(data,c(student_id),count)
##   student_id student_names position year freq
## 1          1           Ram    First 2015    1
## 2          2         Shyam   Second 2015    1
## 3          3        Laxman    Third 2015    1
  • Merging – merge(), intersect() etc.

Summary analysis

  • Datasets often very large. Its important to collect summary statistics
dim(ToothGrowth) #dimensions of dataset
## [1] 60  3
head(ToothGrowth) #shows first part of dataset. try tail().
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

Contd…

summary(ToothGrowth) #reports summary of dataset
##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000
str(ToothGrowth) #more information
## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

Contd…

#computes summary statistics on subsets of data
aggregate(ToothGrowth,by=list(ToothGrowth$dose),class)
##   Group.1     len   supp    dose
## 1     0.5 numeric factor numeric
## 2     1.0 numeric factor numeric
## 3     2.0 numeric factor numeric

Contd…

quantile(ToothGrowth$len, na.rm=TRUE) #quantiles
##     0%    25%    50%    75%   100% 
##  4.200 13.075 19.250 25.275 33.900
table(ToothGrowth$dose, ToothGrowth$supp) #tabulate data based on parameters
##      
##       OJ VC
##   0.5 10 10
##   1   10 10
##   2   10 10
object.size(ToothGrowth) #size of dataset
## 2568 bytes

Contd…

mean(ToothGrowth$len) #mean
## [1] 18.81333
median(ToothGrowth$len) #median
## [1] 19.25
var(ToothGrowth$len) #variance
## [1] 58.51202
sd(ToothGrowth$len) #standard deviation
## [1] 7.649315

Contd…

range(ToothGrowth$len) #range
## [1]  4.2 33.9
  • Some other important functions to try: xtabs(), ftable(), prop.table(), margin.table() etc.

Simulation, sequencing and sampling

  • Simulation: Useful for inferencing results from data analysis
    • Functions for probability (normal) distribution: rnorm(), dnorm(), pnorm(), qnorm()
    • r – randon no. generation, d – density, p – cummulative distribution, q – quantile
set.seed(3) #sets random no. seed
x<-rnorm(5)
x
## [1] -0.9619334 -0.2925257  0.2587882 -1.1521319  0.1957828
summary(x)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.1520 -0.9619 -0.2925 -0.3904  0.1958  0.2588

Contd…

 

References

Getting started with R

Overview

  • What is R?
  • R’s correspondence with S
  • R features
  • Useful URLs
  • Installing R, RStudio
  • R and Statistics
  • Using R – Getting Started

What is R?

Contd…

  • Useful R books:
    • R in Action by Robert I. Kabacoff. Pub.: Manning Publications
    • Statistical Analysis with R by John M. Quick. Pub.: PACKT Publishing
    • Many more R e-books available through Books24X7 (available to CDAC through MCIT consortium).

Contd…

Contd…

  • R and statistics:
    • A comprehensive statistical platform providing all sorts of data analytics techniques.
    • Strong graphics capabilities to visualize complex data.
    • Designed to support interactive data analysis and exploration.
    • Capable of reading data from variety of sources.
    • Facility to program new statistical methods and packages.
  • Some disadvantages too…
    • Objects stored in primary memory. May impose performance bottlenecks in case of large datasets.
    • No provision of built-in dynamic or 3D graphics. But external packages like plot3D, scatterplot3D etc. available.
    • Similarly, no built-in support for web-based processing. Can be done through third-party packages.
    • Functionality scattered among packages

Using R – Getting started

  • Launch R Interface/RStudio depending on your platform.
  • Utility commands/functions:
    • setwd() – sets working directory.
    setwd("C:/RDemo1")
    • getwd() – gets current working directory.
    getwd()
    ## [1] "C:/RDemo1"
    • dir() – lists the contents of current working directory.
    dir()
    ## [1] "R-Basics.html" "R-Basics.Rmd"
    • ls() – lists names of objects in R environment
    ls()
    ## [1] "metadata"

Contd…

  • help.start() – provides general help.
  • help(“foo”) or ?foo – help on function “foo”. For ex. help(“mean”) or ?mean.
  • help.search(“foo”) or ??foo – search for string “foo” in help system. For ex. help.search(“mean”) or ??mean
  • example(“foo”) – shows examples of function “foo”.
    example("mean")
    ## 
    ## mean> x <- c(0:10, 50)
    ## 
    ## mean> xm <- mean(x)
    ## 
    ## mean> c(xm, mean(x, trim = 0.10))
    ## [1] 8.75 5.50
  • data() – lists all example datasets in currently loaded packages.
  • library() – lists all available packages

Contd…

  • data(foo) – loads dataset “foo” in R. For ex. data(mtcars)
  • library(foo) – load package “foo” in R. For ex. library(plyr).
  • rm(objectlist) – removes one or more objects from R workspace.
  • options() – shows/sets current options for workspace.
  • history(#) – lists last # commands. default 25.
  • install.packages(“foo”) – installs package “foo”. For ex. install.packages(“reshape2”).
  • help(package=”package-name”) – provides brief description of package, an index of functions and datasets in package.
  • print(x) or x- print obejct ‘x’ on terminal.
  • q() – quits current R session.

Using R – Data types

  • Five basic types in R are – character, numeric, integer, complex, logical(true/false).
  • Common data objects are – vector, matrix, list, factor, data frame, table.
  • Creating and assigning to a variable:
x<-1
  • Checking the type of variable:
class(x)
## [1] "numeric"

Contd…

  • Printing a variable:
x #auto-printing
## [1] 1
print(x) #explicit printing
## [1] 1
  • Creating Vector: contains objects of same class.
x<-c(1,2,3) #using c() function
y<-vector("logical", length=10) #using vector() function
length(x) #length of vector x
## [1] 3

Contd…

  • Vector operations: Various arithmetic operations can be performed member-wise.
y<-c(4,5,6)
5*x #multiplication by a scalar
## [1]  5 10 15
x+y #addition of two vectors
## [1] 5 7 9
x*y #multiplication of two vectors
## [1]  4 10 18
x^y #x to the power y
## [1]   1  32 729

Contd…

  • Creating Matrix: Two-dimensional array having elements of same class.
m<-matrix(c(1,2,3,11,12,13), nrow=2,ncol=3) #using matrix() function.
m
##      [,1] [,2] [,3]
## [1,]    1    3   12
## [2,]    2   11   13
dim(m) #dimensions of matrix m
## [1] 2 3
attributes(m) #attributes of matrix m
## $dim
## [1] 2 3

Contd…

  • By default, elements in matrix are filled by column. “byrow” attribute of matrix() can be used to fill elements by row.
m<-matrix(c(1,2,3,11,12,13), nrow=2,ncol=3, byrow = TRUE)
m
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]   11   12   13

Contd…

  • cbind-ing and rbind-ing: By using cbind() and rbind() functions
x<-c(1,2,3)
y<-c(11,12,13)
cbind(x,y)
##      x  y
## [1,] 1 11
## [2,] 2 12
## [3,] 3 13
rbind(x,y)
##   [,1] [,2] [,3]
## x    1    2    3
## y   11   12   13

Contd…

  • Matrix operations/functions:
p<-3*m #multiplication by a scalar
n<-matrix(c(4,5,6,14,15,16), nrow=2,ncol=3)
q<-m+n #addition of two matrices
o<-matrix(c(4,5,6,14,15,16), nrow=3,ncol=2)
r<-m %*% o #matrix multiplication by using %*%
mdash<-t(m) #transpose of matrix
s<-matrix(c(4,5,6,14,15,16,24,25,26), nrow=3,ncol=3,
          byrow=TRUE)
s_det<-det(s) #determinant of s
m_row_sum<-rowSums(m)
m_col_sum<-colSums(m)

Contd…

p
##      [,1] [,2] [,3]
## [1,]    3    6    9
## [2,]   33   36   39
q
##      [,1] [,2] [,3]
## [1,]    5    8   18
## [2,]   16   26   29
r
##      [,1] [,2]
## [1,]   32   92
## [2,]  182  542

Contd…

mdash
##      [,1] [,2]
## [1,]    1   11
## [2,]    2   12
## [3,]    3   13
s_det
## [1] 1.110223e-14
m_row_sum
## [1]  6 36
m_col_sum
## [1] 12 14 16

Contd…

  • List: A special type of vector containing elements of different classes
x<-list(1,"p",TRUE,2+4i) #using list() function
x
## [[1]]
## [1] 1
## 
## [[2]]
## [1] "p"
## 
## [[3]]
## [1] TRUE
## 
## [[4]]
## [1] 2+4i

Contd…

  • Factor: Represents categorical data. Can be ordered or unordered.
    status<-c("low","high","medium","high","low")
    x<-factor(status, ordered=TRUE,
            levels=c("low","medium","high")) #using factor() function
    x
    ## [1] low    high   medium high   low   
    ## Levels: low < medium < high
    • ‘levels’ argument is used to set the order of levels.
    • First level forms the baseline level.
    • Without any order, levels are called nominal. Ex. – Type1, Type2, …
    • With order, levels are called ordinal. Ex. – low, medium, …

Contd…

  • Data frame: Used to store tabular data. Can contain different classes
student_id<-c(1,2,3)
student_names<-c("Ram","Shyam","Laxman")
position<-c("First","Second","Third")
data<-data.frame(student_id,student_names,position) #using data.frame() function
data
##   student_id student_names position
## 1          1           Ram    First
## 2          2         Shyam   Second
## 3          3        Laxman    Third
data$student_id #accessing a particular column
## [1] 1 2 3

Contd…

nrow(data) #no. of rows in data
## [1] 3
ncol(data) #no. of columns in data
## [1] 3
names(data) #column names of data
## [1] "student_id"    "student_names" "position"

Using R – Control structures

  • R provides all types of control structures: if-else, for, while, repeat, break, next, return.
  • Mainly used within functions/scripts.
x<-5
if(x > 7) #if-else structure
  y<-TRUE else
    y<-FALSE
y
## [1] FALSE
for(i in 1:10) #for loop
  print(i)
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10

Contd…

count<-0
while(count < 10) #while loop
  count<-count+1
count
## [1] 10
  • repeat is used to create an infinite loop. It can be terminated only through a call to break.
  • next is used to skip an interation in a loop.
  • return is used to return a value from a function.

Using R – looping functions

  • These functions can be used loop over various type of objects.
  • lapply – loop over a list and evaluate a function on each element.
  • sapply – same as lapply but try to simplify the result.
  • apply – apply a function over the margins of an array
  • tapply – apply a function over the subsets of a vector
x<-list(a=1:5,b=rnorm(20))
lapply(x,sum) #lapply returns a list
## $a
## [1] 15
## 
## $b
## [1] -0.8801658

Contd…

x<-matrix(c(1,2,3,11,12,13), nrow=2, ncol=3,byrow=TRUE)
# MARGIN=1 for rows, MARGIN=2 for columns
apply(x,MARGIN=1,FUN=sum)
## [1]  6 36
y<-c(rnorm(20),runif(20),rnorm(20,1))
f<-gl(3,20) #generate factor levels as per given pattern
tapply(y,f,mean)
##          1          2          3 
## -0.2668254  0.5382292  0.9893389

Using R – Subsetting

  • Refers to extract sub-segment of data from R objects.
  • Important while working with large datasets.
  • There are various operators.
  • [ used to extract the object of same class as original generally from a vector or matrix.
  • [[ used to extract elements of a list or data frame.
  • $ used to extract elements from a list or data frame by name.
x<-c(1,2,3,4)
x[2]
## [1] 2
x[1:3]
## [1] 1 2 3

Contd…

  • Subsetting a matrix:
x<-matrix(c(1,2,3,11,12,13), nrow=2, ncol=3,byrow=TRUE)
x[1,2]
## [1] 2
x[1,]
## [1] 1 2 3
x[,2]
## [1]  2 12

Contd…

  • Subsetting a list:
x<-list(a=1,b="p",c=TRUE,d=2+4i)
x[[1]]
## [1] 1
x$d
## [1] 2+4i
x[["c"]]
## [1] TRUE
x["b"]
## $b
## [1] "p"

Contd…

  • Subsetting a data frame
data[1,]
##   student_id student_names position
## 1          1           Ram    First
data$student_names
## [1] Ram    Shyam  Laxman
## Levels: Laxman Ram Shyam
data[data$position=="Second",]
##   student_id student_names position
## 2          2         Shyam   Second
  • Using logical ANDs and ORs
    data[data$student_id>=2 & data$position=="Third",]
    ##   student_id student_names position
    ## 3          3        Laxman    Third

Using R – Functions

  • Created using the function() directive.
  • Can be passed as arguments to other functions. Can be nested.
  • Return value is the last expression to be evaluated inside function body.
  • Have named arguments with default values.
  • Some arguments can be missing during function calls.
add<-function(a=1,b=2,c=3) {
   s = a+b+c
   print(s)
  }
add()
## [1] 6
add(10,11,12)
## [1] 33
add(10)
## [1] 15

R Source files

  • Should be saved/created with .R extension.
  • Can be used to store functions, commands required to be executed sequentially etc.
  • source() function used to load such R scripts into R workspace.
source("C:/RDemo/test.R")
add()
## [1] 6

Contd…

source("C:/RDemo/test1.R", echo=T)
## 
## > x <- 1
## 
## > y <- 2
## 
## > x + y
## [1] 3
source("C:/RDemo/test1.R", print.eval=T)
## [1] 3

References

Best Practices for Using R Securely

If you download R (or R packages) using an unencrypted Internet connection, there is a possibility that a malicious actor could modify the code in transit (or substitute their own file), if they have access to the connection linking you and the CRAN server delivering the code. (This is possible, for example, when you download R using an unsecured Wi-Fi network.) This could potentially give an attacker the same rights you have to execute code on your system.

To eliminate the possibility of such an attack, the R Consortium recommends all R users to always download R and R packages using an encrypted HTTPS connection from a secure server. Read about Best Practices for Using R Securely.

Open House at CDAC on the occasion of National Science Day

CDAC Mumbai is organising an Open House on the occasion of National Science Day(28th February, 2014) at its Kharghar, Navi Mumbai campus. This Open House will not only showcase CDAC products and projects, but also include Quiz Show, Programming Competition and Much More.
1890420_658259194231044_678182775_o

Sangrah – Knowledge Repository for FOSS in Education from CDAC, Mumbai

CDAC, Mumbai has announced the beta release of portal SangrahKnowledge Repository for FOSS in Education . This portal contains resources about different categories like Learning Management System, Content Management System, etc. It also contains user experiences for these categories, comparative analysis of various tools from these categories, specialised search, and collaboration facility for community supported content updates.

The portal is maintained with least manual intervention as most of the tasks including, resource collection, categorization, user experience identification, comparative analysis, etc are largely automated.

The portal is intended for academic institutions, entrepreneurs, among others to help them to adopt Free and Open Source Softwares (FOSS).

The portal is still evolving, hence feedback about the portal, improvement suggestions can be given through the feedback section on portal.

Users can visit and register on the portal at – http://nrcfoss.cdacmumbai.in/sangrah

Release of new version of GNU/Linux distribution for Cognitively Challenged by CDAC, Mumbai

Centre for Development of Advanced Computing (CDAC) has released the new version (version 0.1.2) of GNU/Linux distribution for CognitivelyChallenged. Cognitively challenged people face different kinds of problems such as memory loss, forgetfulness, attention problems etc. Therefore, the major objective of this distribution is to provide an accessible desktop environment suitable to such users. The major highlights of this distribution are simplified and accessible desktop environment, simplified applications, tagged file system, tag-based searching, user’s activity log, reminder facility etc. that are specifically aimed to reduce distraction and memory load during computer interaction. These salient features of the distribution can be of immense help to such users and their caretakers, while using computer. This distribution is based on Ubuntu 10.04 and offers a number of improvements/enhancements over previously released version (version 0.1.1). These improvements/enhancements have been incorporated based on feedbacks and suggestions received from various organisations and users.

Major highlights in the current release:

  • Faster tag based searching
  • Facility to add new user-defined image tags
  • Enhanced tag control center to edit/delete existing tags(textual and image both).
  • Enhanced tag control center to add new file extensions for which tag setting option should be enabled.
  • New educational games included (The Number Race and Tux Type)

GNU/Linux distribution for Cognitively Challenged-0.1.2 can be downloaded from here.

More details about the distribution can be accessed at http://www.cdacmumbai.in/glcc.

Details of various enhancements made in the current version can be found at http://nrcfoss.cdacmumbai.in/access/LinuxForCC-0.1.2-docs/ChangeLog_0.1.2.pdf.

Feedback and suggestions about the distribution can be sent at ossd[at]cdac[dot]in.

ALViC – Accessible Linux for Visually Challenged launched

ALViC- Accessible Linux for Visually Challenged was launched on 11th February, 2013 by Prof R. Chidambaram, Principal Scientific Adviser, Govt. of India & Shri J. Satyanarayana, Secretary, DeitY, Ministry of Comm. & I.T., Govt. of India during the CDAC Technology Conclave on 11th Feb, 2013 at Indian Habitat Centre, New Delhi. During the 2-days technology conclave, a number of technologies and products developed by CDAC under various thematic area were showcased.

ALViC is a complete desktop environment which provides a comprehensive solution for Visually Challenged users. This is a GNU/Linux distribution based on Ubuntu 10.04; and uses Orca 3.2.0 xdesktop screen reader as the main interaction mechanism for visually challenged users. They can use it out of the box because accessibility features suitable for fully blind as well as for partially blind users are enabled by default.

Main Features :

  • Free and open source desktop environment
  • Enhanced Orca with skim read, sentence navigation, list shortcut and structural navigation of text documents
  • PDF documents made accessible in Linux environment
  • Easy navigation and search facility on Desktop icon view
  • Accessible login for visually challenged users
  • Suitable desktop themes for partial blind
  • Other assistive tools like OCRFeeder, Audio book converter, Emerson DAISY reader, sound converter etc. useful for visually challenged users are also included.

This product has been released under the project ‘Enhancing Accessibility for FOSS Desktops’ at CDAC, Mumbai being carried out under NRCFOSS-Phase II. The research and development activities under this project are aimed at developing software-based assistive technologies/solutions for the differently-abled people.

Download:

ALViC can be downloaded from here.

Launch of ALViC can be watched here.

More details and documentation about ALViC can be accessed here.

Anumaan listed on Softpedia

Anumaan – open source predictive text entry system from CDAC, Mumbai has been added to the database of Softpedia. Softpedia is a library of over 400,000 free and free-to-try softwares. Anumaan on Softpedia can be accessed from here. Anumaan has also been awarded “100% Free” award from Softpedia signifying that Anumaan is a clean product.

Anumaan home page