‘Strong Signal’ Stirs Interest in Hunt for Alien Life

A “strong signal” detected by a radio telescope in Russia that is scanning the heavens for signs of extraterrestrial life has stirred interest among the scientific community. The signal is from the direction of a HD164595, a star about 95 light-years from Earth. Read more…

AlphaGo: Possible repercussions and India

ET’s editorial on March 11, 2016 talks about AlphaGo and draws some interesting sketches. At one point, “An AI-run factory, goes a joke, employs just a man and a dog. The dog’s job is to keep the man away from the factory. Why have the man at all, in that case? Someone has to feed the dog.”,
From the same editorial – A possible scenario for India: “AI will enhance productivity and profits for all companies that can master it and deploy it. Much of India’s advanced IT services industry might get replaced by AI, unless industry itself deploys AI. Indian universities have to teach and advance AI in all its myriad forms. India’s human intelligence potential must be realized, for the Indian economy to benefit from AI rather than be its victim.”
Lets wait and watch how it unfolds.

Data Handling with R

Overview

  • Raw Vs Processed data
  • Reading data into R
  • Pre-processing
  • Summary analysis
  • Useful data sources

Raw Vs Processed data

  • Raw data:
    • Original source of data
    • Comes in wide varieties
    • Hard to use directly for data analysis exercise.
    • Needs to be processed.
  • Processed data
    • Ready for data analysis.
    • Processing involves transforming, subsetting, merging etc.
    • Processing should be performed as per set standards.
    • All processing steps should be recorded.
  • Ingredients of data analysis pipeline:
    • Raw data
    • Tidy (processed) data ready for analysis
    • Codebook describing each variable and its values in tidy dataset, other variables not in dataset, summary choices, experimental study design etc.
    • Explicit and exact step-by-step approach for data analysis against said objectives.

Reading data into R

  • Downloading from Internet:
    • download.file() function: download.file(url=”fileurl”, destfile=”filename”,method=”method-name”)
    • methods: curl, wget,lynx,internal
    #fileurl<-"https://data.gov.in/resources/weekly-wholesale-price-turarhar-dal-upto-2012/download"
    #download.file(url=fileurl, destfile="tur-dal-price-upto-2012.csv", method="auto")
    #list.files("./")
  • Reading local files: Use read.table(), read.csv() functions.
    dataset<-read.csv("fdata.csv")
    class(dataset)
    ## [1] "data.frame"
    dim(dataset)
    ## [1] 14931   239
    • Important parameters: sep, header, quote, na.strings, nrows, skip.

Contd…

  • There are many more methods in R to read from different data sources such as Excel, XML, JSON, MySQL, PostgreSQL, from web and APIs etc.

Pre-processing

  • Subsetting revisited
    • Using logical ANDs and ORs
    student_id<-c(1,2,3)
    student_names<-c("Ram","Shyam","Laxman")
    position<-c("First","Second","Third")
    data<-data.frame(student_id,student_names,position) #using data.frame() function
    data[data$student_id>=2 & data$position=="Third",]
    ##   student_id student_names position
    ## 3          3        Laxman    Third

Contd…

  • Sorting and ordering: by using sort() and order() function.
    sort(data$student_names)
    ## [1] Laxman Ram    Shyam 
    ## Levels: Laxman Ram Shyam
    data[order(data$student_names),]
    ##   student_id student_names position
    ## 3          3        Laxman    Third
    ## 1          1           Ram    First
    ## 2          2         Shyam   Second

Contd…

  • Handling with missing values: NA – missing value, NaN – undefined mathematical expressions
x<-c(1,2,NA,20,55,NaN)
#checking for NAs,NaNs
is.na(x)
## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE
is.nan(x)
## [1] FALSE FALSE FALSE FALSE FALSE  TRUE

Contd…

#removing NAs
bad<-is.na(x)
x[!bad]
## [1]  1  2 20 55
#taking subset with no missing values
good<-complete.cases(x) #returns all complete cases with no NAs.
good
## [1]  TRUE  TRUE FALSE  TRUE  TRUE FALSE
x[good]
## [1]  1  2 20 55

Contd…

  • Reshaping data: data needs to be changed from one format to other.
    • Using reshape2 package:
    library(reshape2)
    ## Warning: package 'reshape2' was built under R version 3.0.3
    #converts to flat format, unique id-variable combination
    mdata<-melt(data)
    ## Using student_names, position as id variables
    mdata
    ##   student_names position   variable value
    ## 1           Ram    First student_id     1
    ## 2         Shyam   Second student_id     2
    ## 3        Laxman    Third student_id     3

Contd…

dcast(mdata, student_names~variable) #casts a molten data frame to a data frame or array
##   student_names student_id
## 1        Laxman          3
## 2           Ram          1
## 3         Shyam          2
split(data,data$student_id) #splits data into groups
## $`1`
##   student_id student_names position
## 1          1           Ram    First
## 
## $`2`
##   student_id student_names position
## 2          2         Shyam   Second
## 
## $`3`
##   student_id student_names position
## 3          3        Laxman    Third

Contd…

#adding a new variable
data$year<-c(2015,2015,2015)
data
##   student_id student_names position year
## 1          1           Ram    First 2015
## 2          2         Shyam   Second 2015
## 3          3        Laxman    Third 2015

Contd…

  • Another important package for reshaping is plyr (split-apply-combine paradign for R).
library(plyr)
## Warning: package 'plyr' was built under R version 3.0.3
#ddply() function - takes data frame is input, returns a data frame
ddply(data,c(student_id),count)
##   student_id student_names position year freq
## 1          1           Ram    First 2015    1
## 2          2         Shyam   Second 2015    1
## 3          3        Laxman    Third 2015    1
  • Merging – merge(), intersect() etc.

Summary analysis

  • Datasets often very large. Its important to collect summary statistics
dim(ToothGrowth) #dimensions of dataset
## [1] 60  3
head(ToothGrowth) #shows first part of dataset. try tail().
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

Contd…

summary(ToothGrowth) #reports summary of dataset
##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000
str(ToothGrowth) #more information
## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

Contd…

#computes summary statistics on subsets of data
aggregate(ToothGrowth,by=list(ToothGrowth$dose),class)
##   Group.1     len   supp    dose
## 1     0.5 numeric factor numeric
## 2     1.0 numeric factor numeric
## 3     2.0 numeric factor numeric

Contd…

quantile(ToothGrowth$len, na.rm=TRUE) #quantiles
##     0%    25%    50%    75%   100% 
##  4.200 13.075 19.250 25.275 33.900
table(ToothGrowth$dose, ToothGrowth$supp) #tabulate data based on parameters
##      
##       OJ VC
##   0.5 10 10
##   1   10 10
##   2   10 10
object.size(ToothGrowth) #size of dataset
## 2568 bytes

Contd…

mean(ToothGrowth$len) #mean
## [1] 18.81333
median(ToothGrowth$len) #median
## [1] 19.25
var(ToothGrowth$len) #variance
## [1] 58.51202
sd(ToothGrowth$len) #standard deviation
## [1] 7.649315

Contd…

range(ToothGrowth$len) #range
## [1]  4.2 33.9
  • Some other important functions to try: xtabs(), ftable(), prop.table(), margin.table() etc.

Simulation, sequencing and sampling

  • Simulation: Useful for inferencing results from data analysis
    • Functions for probability (normal) distribution: rnorm(), dnorm(), pnorm(), qnorm()
    • r – randon no. generation, d – density, p – cummulative distribution, q – quantile
set.seed(3) #sets random no. seed
x<-rnorm(5)
x
## [1] -0.9619334 -0.2925257  0.2587882 -1.1521319  0.1957828
summary(x)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.1520 -0.9619 -0.2925 -0.3904  0.1958  0.2588

Contd…

 

References