Text Analytics – Getting insights from text

Technology revolution is changing every aspect of the human life. Media and Marketing are no different. Among all the technologies that are contributing to the advancement, Data Science is at the forefront. As Text Data is being continuously generated and consumed in various formats and sizes from a number of varied sources, it is becoming an important asset to organizations. But this asset can be leveraged upon, only if stored, processed and analysed efficiently with the help of intelligent algorithms. There is a growing interest to utilize such data for the improvement of business, health, education, society, etc. There are many ways to process and analyse such data, covering broad techniques such as text visualisation, classification, named entity recognition, sentiment analysis, etc. Effective applications of these techniques can give organisations valuable insights leading to competitive advantage, efficient service delivery and above all higher customer satisfaction.

With this in the view, CDAC Mumbai is conducting a series of short-term courses in Data Science and Machine Learning. This is second series of such courses and latest in this series is  “Text Analytics” going to be conducted during May 18-20, 2017. Registrations for the course are open. More details can be accessed at http://www.kbcs.in/datascience.

‘Strong Signal’ Stirs Interest in Hunt for Alien Life

A “strong signal” detected by a radio telescope in Russia that is scanning the heavens for signs of extraterrestrial life has stirred interest among the scientific community. The signal is from the direction of a HD164595, a star about 95 light-years from Earth. Read more…

R – your tool for data analysis

R‬ is a language and environment for Statistical Computing and Graphics. R provides a free/open source, cross­-platform, object-oriented environment to perform data analysis and visualisation tasks. Strength of R lies in its vibrant community, robust package repository and strong graphics capabilities. ‪

R provides all necessary tools required for various stages of a data analysis project. It provides techniques for data acquisition and processing as well as for data analysis and visualisation. It ranges from accessing data in various formats (CSV, XML…) to all possible ways of data manipulation (tabulation, aggregation…)to rich support for graphics (histogram, box plot etc.) to statistical models (regression, ANOVA…).

It is not always necessary to use built-in and supported functions and packages. Depending on the requirements, one can also develop his/her own functions, scripts and packages.

Recently, CDAC Mumbai has announced a 3-day course on R entitled “Using R for Data Visualisation and Analytics”. This course is aimed to cover in detail the features of R related to data analysis and visualisation. More details can be accessed here.

AlphaGo: Possible repercussions and India

ET’s editorial on March 11, 2016 talks about AlphaGo and draws some interesting sketches. At one point, “An AI-run factory, goes a joke, employs just a man and a dog. The dog’s job is to keep the man away from the factory. Why have the man at all, in that case? Someone has to feed the dog.”,
From the same editorial – A possible scenario for India: “AI will enhance productivity and profits for all companies that can master it and deploy it. Much of India’s advanced IT services industry might get replaced by AI, unless industry itself deploys AI. Indian universities have to teach and advance AI in all its myriad forms. India’s human intelligence potential must be realized, for the Indian economy to benefit from AI rather than be its victim.”
Lets wait and watch how it unfolds.

Call for Participation – Short-term Courses on Data Science at CDAC, Kharghar, Navi Mumbai

CFP-img

We are living in a Data Age. Data is being continuously generated and consumed in various formats, and sizes from a number of varied sources. This data can be a big asset if stored, processed and analysed efficiently in real time with the help of intelligent algorithms. There is a growing interest to utilize such data for the improvement of business, health, education, society, etc. There are many ways to process and analyse such data spanning techniques like data visualisation, text analysis, predictions and recommendations etc. Applications of these techniques can give companies and organisations valuable insights leading to competitive advantage, efficient service delivery and above all customer satisfaction. And so the demand for skilled resources in these fields is growing day by day.

With this view, CDAC, Mumbai is announcing the following short-term courses in Data Science and Machine Learning.

  1. Using R for data visualization and analytics: This course introduces R – a language and environment for Statistical Computing and Visualisation. In recent years, R has become very popular due its open source cross-platform nature, robust package repository and strong graphics capabilities. During the course, one will not only learn about basics of R, but also about techniques of data acquisition and processing. Course will also cover in detail the features of R related to data analysis and visualisation.
  2. Text Analytics: The course aims to provide learners an understanding of the methods for text analytics. It will cover major techniques for mining and analyzing text data to discover interesting patterns, extract useful knowledge, and support decision making. The techniques will include Named Entity Recognition, Sentiment Analysis and Text Categorization among others. Learners will also be introduced to various open source utilities for developing text analytics applications.
  3. Predictive Analytics and Recommender Systems: The course covers various methods of Predictive Analytics and Recommender Systems drawn from Statistics, Data Mining, and Machine Learning. We will discuss popular algorithms in the domain and their use in various applications. The course emphasizes hands-on approach for better understanding of the techniques used in the domain. During the course, mainly open-source tools will be used for illustrations and lab.

Target Audience: Individuals, students, and professionals from government, industry, and academia working / interested in Data Science

Courses Schedule:

Course Name Using R for data visualization and analytics Text Analytics Predictive Analytics and Recommender Systems
Course Dates May 19 – 21, 2016 June 16 – 18, 2016 July 14 – 16, 2016
Final Registration Date May 04, 2016 June 01, 2016 June 30, 2016

Registration Process: Registration fee per course for a candidate is Rs. 7500/-. For more details about registration and payment process, please visit http://www.kbcs.in/datascience.

Note: Registration will be on first come first serve basis. Final participation in any of the courses will be subject to the realization of payment of applicable registration fee.

For More details, please contact:

Centre for Development of Advanced Computing (Formerly NCST)

Near Bharati Vidyapeeth, Raintree Marg, Sector 7, CBD Belapur,

Navi Mumbai – 400614, Maharashtra, INDIA

Telephone: + 91-22-27565303/304/305

Fax: +91-22-27565004

email: kbcs@cdac.in

URL: http://www.kbcs.in/datascience

Data Handling with R

Overview

  • Raw Vs Processed data
  • Reading data into R
  • Pre-processing
  • Summary analysis
  • Useful data sources

Raw Vs Processed data

  • Raw data:
    • Original source of data
    • Comes in wide varieties
    • Hard to use directly for data analysis exercise.
    • Needs to be processed.
  • Processed data
    • Ready for data analysis.
    • Processing involves transforming, subsetting, merging etc.
    • Processing should be performed as per set standards.
    • All processing steps should be recorded.
  • Ingredients of data analysis pipeline:
    • Raw data
    • Tidy (processed) data ready for analysis
    • Codebook describing each variable and its values in tidy dataset, other variables not in dataset, summary choices, experimental study design etc.
    • Explicit and exact step-by-step approach for data analysis against said objectives.

Reading data into R

  • Downloading from Internet:
    • download.file() function: download.file(url=”fileurl”, destfile=”filename”,method=”method-name”)
    • methods: curl, wget,lynx,internal
    #fileurl<-"https://data.gov.in/resources/weekly-wholesale-price-turarhar-dal-upto-2012/download"
    #download.file(url=fileurl, destfile="tur-dal-price-upto-2012.csv", method="auto")
    #list.files("./")
  • Reading local files: Use read.table(), read.csv() functions.
    dataset<-read.csv("fdata.csv")
    class(dataset)
    ## [1] "data.frame"
    dim(dataset)
    ## [1] 14931   239
    • Important parameters: sep, header, quote, na.strings, nrows, skip.

Contd…

  • There are many more methods in R to read from different data sources such as Excel, XML, JSON, MySQL, PostgreSQL, from web and APIs etc.

Pre-processing

  • Subsetting revisited
    • Using logical ANDs and ORs
    student_id<-c(1,2,3)
    student_names<-c("Ram","Shyam","Laxman")
    position<-c("First","Second","Third")
    data<-data.frame(student_id,student_names,position) #using data.frame() function
    data[data$student_id>=2 & data$position=="Third",]
    ##   student_id student_names position
    ## 3          3        Laxman    Third

Contd…

  • Sorting and ordering: by using sort() and order() function.
    sort(data$student_names)
    ## [1] Laxman Ram    Shyam 
    ## Levels: Laxman Ram Shyam
    data[order(data$student_names),]
    ##   student_id student_names position
    ## 3          3        Laxman    Third
    ## 1          1           Ram    First
    ## 2          2         Shyam   Second

Contd…

  • Handling with missing values: NA – missing value, NaN – undefined mathematical expressions
x<-c(1,2,NA,20,55,NaN)
#checking for NAs,NaNs
is.na(x)
## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE
is.nan(x)
## [1] FALSE FALSE FALSE FALSE FALSE  TRUE

Contd…

#removing NAs
bad<-is.na(x)
x[!bad]
## [1]  1  2 20 55
#taking subset with no missing values
good<-complete.cases(x) #returns all complete cases with no NAs.
good
## [1]  TRUE  TRUE FALSE  TRUE  TRUE FALSE
x[good]
## [1]  1  2 20 55

Contd…

  • Reshaping data: data needs to be changed from one format to other.
    • Using reshape2 package:
    library(reshape2)
    ## Warning: package 'reshape2' was built under R version 3.0.3
    #converts to flat format, unique id-variable combination
    mdata<-melt(data)
    ## Using student_names, position as id variables
    mdata
    ##   student_names position   variable value
    ## 1           Ram    First student_id     1
    ## 2         Shyam   Second student_id     2
    ## 3        Laxman    Third student_id     3

Contd…

dcast(mdata, student_names~variable) #casts a molten data frame to a data frame or array
##   student_names student_id
## 1        Laxman          3
## 2           Ram          1
## 3         Shyam          2
split(data,data$student_id) #splits data into groups
## $`1`
##   student_id student_names position
## 1          1           Ram    First
## 
## $`2`
##   student_id student_names position
## 2          2         Shyam   Second
## 
## $`3`
##   student_id student_names position
## 3          3        Laxman    Third

Contd…

#adding a new variable
data$year<-c(2015,2015,2015)
data
##   student_id student_names position year
## 1          1           Ram    First 2015
## 2          2         Shyam   Second 2015
## 3          3        Laxman    Third 2015

Contd…

  • Another important package for reshaping is plyr (split-apply-combine paradign for R).
library(plyr)
## Warning: package 'plyr' was built under R version 3.0.3
#ddply() function - takes data frame is input, returns a data frame
ddply(data,c(student_id),count)
##   student_id student_names position year freq
## 1          1           Ram    First 2015    1
## 2          2         Shyam   Second 2015    1
## 3          3        Laxman    Third 2015    1
  • Merging – merge(), intersect() etc.

Summary analysis

  • Datasets often very large. Its important to collect summary statistics
dim(ToothGrowth) #dimensions of dataset
## [1] 60  3
head(ToothGrowth) #shows first part of dataset. try tail().
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

Contd…

summary(ToothGrowth) #reports summary of dataset
##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000
str(ToothGrowth) #more information
## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

Contd…

#computes summary statistics on subsets of data
aggregate(ToothGrowth,by=list(ToothGrowth$dose),class)
##   Group.1     len   supp    dose
## 1     0.5 numeric factor numeric
## 2     1.0 numeric factor numeric
## 3     2.0 numeric factor numeric

Contd…

quantile(ToothGrowth$len, na.rm=TRUE) #quantiles
##     0%    25%    50%    75%   100% 
##  4.200 13.075 19.250 25.275 33.900
table(ToothGrowth$dose, ToothGrowth$supp) #tabulate data based on parameters
##      
##       OJ VC
##   0.5 10 10
##   1   10 10
##   2   10 10
object.size(ToothGrowth) #size of dataset
## 2568 bytes

Contd…

mean(ToothGrowth$len) #mean
## [1] 18.81333
median(ToothGrowth$len) #median
## [1] 19.25
var(ToothGrowth$len) #variance
## [1] 58.51202
sd(ToothGrowth$len) #standard deviation
## [1] 7.649315

Contd…

range(ToothGrowth$len) #range
## [1]  4.2 33.9
  • Some other important functions to try: xtabs(), ftable(), prop.table(), margin.table() etc.

Simulation, sequencing and sampling

  • Simulation: Useful for inferencing results from data analysis
    • Functions for probability (normal) distribution: rnorm(), dnorm(), pnorm(), qnorm()
    • r – randon no. generation, d – density, p – cummulative distribution, q – quantile
set.seed(3) #sets random no. seed
x<-rnorm(5)
x
## [1] -0.9619334 -0.2925257  0.2587882 -1.1521319  0.1957828
summary(x)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.1520 -0.9619 -0.2925 -0.3904  0.1958  0.2588

Contd…

 

References

ICT for School workshop at CDAC, Kharghar, Navi Mumbai on November 20, 2015

A half day Workshop on “ICT for Schools” will be conducted on Friday 20th November, 2015 [2:00 PM-5:30 PM] at Kharghar campus of C-DAC Mumbai. The workshop will be covering presentation-cum-demo on:

  1. Online Labs
  2. eBasta – School Books to eBooks and
  3. Assessment and Monitoring Framework.

CBSE/STATE board teachers from Classes VIII, IX, X ; Subjects – Science, Maths, English are invited to participate in this workshop. This workshop will be conducted by Educational Technology Unit of C-DAC Mumbai. 

Please comment or write to etu[AT]cdac[DOT]in, if you are interested, or want more information.
ICT-for-Schools