DATA606 - Statistics & Probability for Data Analytics

February 2, 2017

Agenda

Introduction
- Syllabus
- Assignments
  - Homework
  - Labs
  - Data Project
  - Final exam
  - Meetup Presentation
- The DATA606 R Package
- Using R Markdown
Intro to Data (Chapter 1)

Introduction

A little about me:

Currently Executive Director at Excelsior College
- Principal Investigator for a Department of Education Grant (part of their FIPSE First in the World program) to develop a Diagnostic Assessment and Achievement of College Skills (www.DAACS.net)
Authored over a dozen R packages including:
- likert
- sqlutils
- timeline
Specialize in propensity score methods. Three new methods/R packages developed include:

Also a Father…

And photographer.

Syllabus

Syllabus and course materials are here: http://data606.net

We will use Blackboard to submit assignments.

I would like to use Github's Issue tracker for course discussions (this is my first semester trying this, so tell me how it goes).

Please submit PDF files and if you used Rmarkdown, the Rmd file too.

Course Calendar

See http://data606.net/schedule/ for up-to-date calendar.

Start	Due Date	Chapter	Topic
Jan-30	Feb-5	1	Intro to Data
Feb-6	Feb-12	2	Probability
Feb-13	Feb-26	3	Distributions
Feb-27	Mar-12	4	Foundation for Inference
Mar-13	Apr-2	5	Inference for Numerical Data
Mar-13	Apr-2	6	Inference for Categorical Data
Apr-3	Apr-23	7	Linear Regression
Apr-24	May-7	8	Multiple & Logistic Regression
May-8	May-18	Navarro	Introduction to Bayesian Analysis
May-19	May-25		Final Exam

Assignments

Getting Aquainted (1%)
Homework (16%)
- Approximately six problems per chapter.
- Answers can be handwritten or typed (I suggest using R Markdown)
- Submit a PDF on Blackboard.
Labs (40%)
- Labs are designed to introduce to you doing statistics with R.
- Answer the questions in the main text as well as the "On Your Own" section.
- Submit both the R Markdown file and PDF of the output on Blackboard.
Data Project (20%)
- This allows you to analyze a dataset of your choosing. Projects will be shared with the class. This provides an opportunity for everyone to see different approaches to analyzing different datasets.
- Proposal is due March 7th (5%); Final project is due May 16th (15%).
Final exam (18%)
Meetup Presentation (5%)
- Present one practice problem during our weekly meetups. Signup using the Google Spreadsheet.

The `DATA606` R Package

The package can be installed from Github using the devtools package.

devtools::install_github('jbryer/DATA606')

Important Functions

library('DATA606') - Load the package
vignette(package='DATA606') - Lists vignettes in the DATA606 package
vignette('os3') - Loads a PDF of the OpenIntro Statistics book
data(package='DATA606') - Lists data available in the package
getLabs() - Returns a list of the available labs
viewLab('Lab0') - Opens Lab0 in the default web browser
startLab('Lab0') - Starts Lab0 (copies to getwd()), opens the Rmd file
shiny_demo() - Lists available Shiny apps

Using R Markdown

R Markdown files are provided for all the labs. You can start a lab using the DATA606::startLab function.

However, creating new R Markdown files in RStudio can be done by clicking File > New File > R Markdown.

Working Directories

When working with files in R, there are two ways to specify paths: 1. Using absolute paths (i.e. starting with C:/ or / on Windows and Mac/Lunix, respectively), or relative paths (possibly without any directory information). When working with the latter, where R looks will be based upon the working directory. You can get the working directory with getwd() and set the working directory with setwd(). In RStudio, you can also set the working directory on the Files tab by clicking More, then Set as Working Directory.

Intro to Data

We will use the lego R package in this class which contains information about every Lego set manufactured from 1970 to 2014, a total of 5710 sets.

devtools::install_github("seankross/lego")

library(lego)
data(legosets)

Types of Variables

Numerical (quantitative)
- Continuous
- Discrete
Categorical (qualitative)
- Regular categorical
- Ordinal

Types of Variables

str(legosets)

## Classes 'tbl_df', 'tbl' and 'data.frame':    6172 obs. of  14 variables:
##  $ Item_Number : chr  "10246" "10247" "10248" "10249" ...
##  $ Name        : chr  "Detective's Office" "Ferris Wheel" "Ferrari F40" "Toy Shop" ...
##  $ Year        : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ Theme       : chr  "Advanced Models" "Advanced Models" "Advanced Models" "Advanced Models" ...
##  $ Subtheme    : chr  "Modular Buildings" "Fairground" "Vehicles" "Winter Village" ...
##  $ Pieces      : int  2262 2464 1158 898 13 39 32 105 13 11 ...
##  $ Minifigures : int  6 10 NA NA 1 2 2 3 2 2 ...
##  $ Image_URL   : chr  "http://images.brickset.com/sets/images/10246-1.jpg" "http://images.brickset.com/sets/images/10247-1.jpg" "http://images.brickset.com/sets/images/10248-1.jpg" "http://images.brickset.com/sets/images/10249-1.jpg" ...
##  $ GBP_MSRP    : num  132.99 149.99 69.99 59.99 9.99 ...
##  $ USD_MSRP    : num  159.99 199.99 99.99 79.99 9.99 ...
##  $ CAD_MSRP    : num  200 230 120 NA 13 ...
##  $ EUR_MSRP    : num  149.99 179.99 89.99 69.99 9.99 ...
##  $ Packaging   : chr  "Box" "Box" "Box" "Box" ...
##  $ Availability: chr  "Retail - limited" "Retail - limited" "LEGO exclusive" "LEGO exclusive" ...

Qualitative Variables

Descriptive statistics:

Contingency Tables
Proportional Tables

Plot types:

Bar plot
Mosaic plot

Contingency Tables

table(legosets$Availability, useNA='ifany')

## 
##        LEGO exclusive    LEGOLAND exclusive         Not specified 
##                   695                     2                  1795 
##           Promotional Promotional (Airline)                Retail 
##                   141                    12                  3120 
##      Retail - limited               Unknown 
##                   403                     4

table(legosets$Availability, legosets$Packaging, useNA='ifany')

##                        
##                         Blister pack  Box Box with backing card Bucket
##   LEGO exclusive                  45  147                     0      1
##   LEGOLAND exclusive               0    2                     0      0
##   Not specified                    0   20                     0      0
##   Promotional                      0   44                     0      0
##   Promotional (Airline)            0   11                     0      0
##   Retail                          53 2575                    16     30
##   Retail - limited                 2  302                     1      5
##   Unknown                          0    1                     0      0
##                        
##                         Canister Foil pack Loose Parts Not specified Other
##   LEGO exclusive               0         0          71             7     5
##   LEGOLAND exclusive           0         0           0             0     0
##   Not specified                0         5           0          1739     0
##   Promotional                  0         0           1             0     3
##   Promotional (Airline)        0         0           0             1     0
##   Retail                      78       285           0             0    28
##   Retail - limited             0         1           0             0     0
##   Unknown                      0         0           0             0     0
##                        
##                         Plastic box Polybag Shrink-wrapped  Tag  Tub
##   LEGO exclusive                  1     412              0    6    0
##   LEGOLAND exclusive              0       0              0    0    0
##   Not specified                   6      24              0    0    1
##   Promotional                     2      90              0    0    1
##   Promotional (Airline)           0       0              0    0    0
##   Retail                          0       4             18    0   33
##   Retail - limited                1      86              0    0    5
##   Unknown                         0       3              0    0    0

Proportional Tables

prop.table(table(legosets$Availability))

## 
##        LEGO exclusive    LEGOLAND exclusive         Not specified 
##          0.1126053143          0.0003240441          0.2908295528 
##           Promotional Promotional (Airline)                Retail 
##          0.0228451069          0.0019442644          0.5055087492 
##      Retail - limited               Unknown 
##          0.0652948801          0.0006480881

Bar Plots

barplot(table(legosets$Availability), las=3)

Bar Plots

barplot(prop.table(table(legosets$Availability)), las=3)

Mosaic Plot

library(vcd)
mosaic(HairEyeColor, shade=TRUE, legend=TRUE)

Quantitative Variables

Descriptive statistics:

Mean
Median
Quartiles
Variance: \({ s }^{ 2 }=\sum _{ i=1 }^{ n }{ \frac { { \left( { x }_{ i }-\bar { x } \right) }^{ 2 } }{ n-1 } }\)
Standard deviation: \(s=\sqrt{s^2}\)

Plot types:

Dot plots
Histograms
Density plots
Box plots
Scatterplots

Measures of Center

mean(legosets$Pieces, na.rm=TRUE)

## [1] 215.1686

median(legosets$Pieces, na.rm=TRUE)

## [1] 82

Measures of Spread

var(legosets$Pieces, na.rm=TRUE)

## [1] 126876.8

sqrt(var(legosets$Pieces, na.rm=TRUE))

## [1] 356.1976

sd(legosets$Pieces, na.rm=TRUE)

## [1] 356.1976

fivenum(legosets$Pieces, na.rm=TRUE)

## [1]    0.0   30.0   82.0  256.5 5922.0

IQR(legosets$Pieces, na.rm=TRUE)

## [1] 226.25

The `summary` Function

summary(legosets$Pieces)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0    30.0    82.0   215.2   256.2  5922.0     112

The `psych` Package

library(psych)
describe(legosets$Pieces, skew=FALSE)

##    vars    n   mean    sd min  max range   se
## X1    1 6060 215.17 356.2   0 5922  5922 4.58

describeBy(legosets$Pieces, group = legosets$Availability, skew=FALSE, mat=TRUE)

##     item                group1 vars    n      mean        sd min  max
## X11    1        LEGO exclusive    1  659 172.74203 442.96954   1 3428
## X12    2    LEGOLAND exclusive    1    2 211.00000 154.14928 102  320
## X13    3         Not specified    1 1747 145.87178 309.19929   1 5195
## X14    4           Promotional    1  140  53.97143 108.42721   1 1000
## X15    5 Promotional (Airline)    1   12 126.16667  47.01612  10  203
## X16    6                Retail    1 3094 245.78119 294.78052   0 3803
## X17    7      Retail - limited    1  402 410.94030 652.06435   1 5922
## X18    8               Unknown    1    4  27.50000  15.96872   6   44
##     range         se
## X11  3427  17.255643
## X12   218 109.000000
## X13  5194   7.397620
## X14   999   9.163772
## X15   193  13.572384
## X16  3803   5.299546
## X17  5921  32.522014
## X18    38   7.984360

Robust Statistics

Median and IQR are more robust to skewness and outliers than mean and SD. Therefore,

for skewed distributions it is often more helpful to use median and IQR to describe the center and spread
for symmetric distributions it is often more helpful to use the mean and SD to describe the center and spread

Dot Plot

stripchart(legosets$Pieces)

Dot Plot

par.orig <- par(mar=c(1,10,1,1))
stripchart(legosets$Pieces ~ legosets$Availability, las=1)

par(par.orig)

Histograms

hist(legosets$Pieces)

Transformations

With highly skewed distributions, it is often helpful to transform the data. The log transformation is a common approach, especially when dealing with salary or similar data.

hist(log(legosets$Pieces))

Density Plots

plot(density(legosets$Pieces, na.rm=TRUE), main='Lego Pieces per Set')

Density Plot (log tansformed)

plot(density(log(legosets$Pieces), na.rm=TRUE), main='Lego Pieces per Set (log transformed)')

Box Plots

boxplot(legosets$Pieces)

boxplot(log(legosets$Pieces))

## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z
## $group == : Outlier (-Inf) in boxplot 1 is not drawn

Scatter Plots

plot(legosets$Pieces, legosets$USD_MSRP)

Examining Possible Outliers (expensive sets)

legosets[which(legosets$USD_MSRP >= 400),]

## # A tibble: 4 × 14
##   Item_Number                                   Name  Year        Theme
##         <chr>                                  <chr> <int>        <chr>
## 1     2000430             Identity and Landscape Kit  2013 Serious Play
## 2     2000431                        Connections Kit  2013 Serious Play
## 3     2000409                 Window Exploration Bag  2010 Serious Play
## 4       10179 Ultimate Collector's Millennium Falcon  2007    Star Wars
## # ... with 10 more variables: Subtheme <chr>, Pieces <int>,
## #   Minifigures <int>, Image_URL <chr>, GBP_MSRP <dbl>, USD_MSRP <dbl>,
## #   CAD_MSRP <dbl>, EUR_MSRP <dbl>, Packaging <chr>, Availability <chr>

Examining Possible Outliers (big sets)

legosets[which(legosets$Pieces >= 4000),]

## # A tibble: 4 × 14
##   Item_Number                                   Name  Year           Theme
##         <chr>                                  <chr> <int>           <chr>
## 1       10214                           Tower Bridge  2010 Advanced Models
## 2     2000409                 Window Exploration Bag  2010    Serious Play
## 3       10189                              Taj Mahal  2008 Advanced Models
## 4       10179 Ultimate Collector's Millennium Falcon  2007       Star Wars
## # ... with 10 more variables: Subtheme <chr>, Pieces <int>,
## #   Minifigures <int>, Image_URL <chr>, GBP_MSRP <dbl>, USD_MSRP <dbl>,
## #   CAD_MSRP <dbl>, EUR_MSRP <dbl>, Packaging <chr>, Availability <chr>

plot(legosets$Pieces, legosets$USD_MSRP)
bigAndExpensive <- legosets[which(legosets$Pieces >= 4000 | legosets$USD_MSRP >= 400),]
text(bigAndExpensive$Pieces, bigAndExpensive$USD_MSRP, labels=bigAndExpensive$Name)

Likert Scales

Likert scales are a type of questionaire where respondents are asked to rate items on scales usually ranging from four to seven levels (e.g. strongly disagree to strongly agree).

library(likert)
library(reshape)
data(pisaitems)
items24 <- pisaitems[,substr(names(pisaitems), 1,5) == 'ST24Q']
items24 <- rename(items24, c(
            ST24Q01="I read only if I have to.",
            ST24Q02="Reading is one of my favorite hobbies.",
            ST24Q03="I like talking about books with other people.",
            ST24Q04="I find it hard to finish books.",
            ST24Q05="I feel happy if I receive a book as a present.",
            ST24Q06="For me, reading is a waste of time.",
            ST24Q07="I enjoy going to a bookstore or a library.",
            ST24Q08="I read only to get information that I need.",
            ST24Q09="I cannot sit still and read for more than a few minutes.",
            ST24Q10="I like to express my opinions about books I have read.",
            ST24Q11="I like to exchange books with my friends."))

`likert` R Package

l24 <- likert(items24)
summary(l24)

##                                                        Item      low
## 10   I like to express my opinions about books I have read. 41.07516
## 5            I feel happy if I receive a book as a present. 46.93475
## 8               I read only to get information that I need. 50.39874
## 7                I enjoy going to a bookstore or a library. 51.21231
## 3             I like talking about books with other people. 54.99129
## 11                I like to exchange books with my friends. 55.54115
## 2                    Reading is one of my favorite hobbies. 56.64470
## 1                                 I read only if I have to. 58.72868
## 4                           I find it hard to finish books. 65.35125
## 9  I cannot sit still and read for more than a few minutes. 76.24524
## 6                       For me, reading is a waste of time. 82.88729
##    neutral     high     mean        sd
## 10       0 58.92484 2.604913 0.9009968
## 5        0 53.06525 2.466751 0.9446590
## 8        0 49.60126 2.484616 0.9089688
## 7        0 48.78769 2.428508 0.9164136
## 3        0 45.00871 2.328049 0.9090326
## 11       0 44.45885 2.343193 0.9609234
## 2        0 43.35530 2.344530 0.9277495
## 1        0 41.27132 2.291811 0.9369023
## 4        0 34.64875 2.178299 0.8991628
## 9        0 23.75476 1.974736 0.8793028
## 6        0 17.11271 1.810093 0.8611554

`likert` Plots

plot(l24)

`likert` Plots

plot(l24, type='heat')

`likert` Plots

plot(l24, type='density')

Pie Charts

There is only one pie chart in OpenIntro Statistics (Diez, Barr, & Çetinkaya-Rundel, 2015, p. 48). Consider the following three pie charts that represent the preference of five different colors. Is there a difference between the three pie charts? This is probably a difficult to answer.

Pie Charts

Source: https://en.wikipedia.org/wiki/Pie_chart.

Just say NO to pie charts!

"There is no data that can be displayed in a pie chart that cannot better be displayed in some other type of chart"
John Tukey

Sampling vs. Census

A census involves collecting data for the entire population of interest. This is problematic for several reasons, including:

It can be difficult to complete a census: there always seem to be some individuals who are hard to locate or hard to measure. And these difficult-to-find people may have certain characteristics that distinguish them from the rest of the population.
Populations rarely stand still. Even if you could take a census, the population changes constantly, so it’s never possible to get a perfect measure.
Taking a census may be more complex than sampling.

Sampling involves measuring a subset of the population of interest, usually randomly.

Sampling Bias

Non-response: If only a small fraction of the randomly sampled people choose to respond to a survey, the sample may no longer be representative of the population.
Voluntary response: Occurs when the sample consists of people who volunteer to respond because they have strong opinions on the issue. Such a sample will also not be representative of the population.
Convenience sample: Individuals who are easily accessible are more likely to be included in the sample.

Observational Studies vs. Experiments

Observational study: Researchers collect data in a way that does not directly interfere with how the data arise, i.e. they merely “observe”, and can only establish an association between the explanatory and response variables.
Experiment: Researchers randomly assign subjects to various treatments in order to establish causal connections between the explanatory and response variables.

Source: XKCD 552 http://xkcd.com/552/

Correlation does not imply causation!

Simple Random Sampling

Randomly select cases from the population, where there is no implied connection between the points that are selected.

Simple Random Sample

Stratified Sampling

Strata are made up of similar observations. We take a simple random sample from each stratum.

Cluster Sampling

Clusters are usually not made up of homogeneous observations so we take random samples from random samples of clusters.

Principles of experimental design

Control: Compare treatment of interest to a control group.
Randomize: Randomly assign subjects to treatments, and randomly sample from the population whenever possible.
Replicate: Within a study, replicate by collecting a sufficiently large sample. Or replicate the entire study.
Block: If there are variables that are known or suspected to affect the response variable, first group subjects into blocks based on these variables, and then randomize cases within each block to treatment groups.

Difference between blocking and explanatory variables

Factors are conditions we can impose on the experimental units.
Blocking variables are characteristics that the experimental units come with, that we would like to control for.
Blocking is like stratifying, except used in experimental settings when randomly assigning, as opposed to when sampling.

More experimental design terminology…

Placebo: fake treatment, often used as the control group for medical studies
Placebo effect: experimental units showing improvement simply because they believe they are receiving a special treatment
Blinding: when experimental units do not know whether they are in the control or treatment group
Double-blind: when both the experimental units and the researchers who interact with the patients do not know who is in the control and who is in the treatment group

Random assignment vs. random sampling

`ggplot2`

ggplot2 is an R package that provides an alternative framework based upon Wilkinson’s (2005) Grammar of Graphics.
ggplot2 is, in general, more flexible for creating "prettier" and complex plots.
Works by creating layers of different types of objects/geometries (i.e. bars, points, lines, polygons, etc.) ggplot2 has at least three ways of creating plots:
1. qplot
2. ggplot(...) + geom_XXX(...) + ...
3. ggplot(...) + layer(...)
We will focus only on the second.

First Example

data(diamonds)
ggplot(diamonds, aes(x=carat, y=price, color=cut)) + geom_point()

Parts of a `ggplot2` Statement

Data
ggplot(myDataFrame, aes(x=x, y=y)
Layers
geom_point(), geom_histogram()
Facets
facet_wrap(~ cut), facet_grid(~ cut)
Scales
scale_y_log10()
Other options
ggtitle('my title'), ylim(c(0, 10000)), xlab('x-axis label')

Lots of geoms

ls('package:ggplot2')[grep('geom_', ls('package:ggplot2'))]

##  [1] "geom_abline"          "geom_area"            "geom_bar"            
##  [4] "geom_bin2d"           "geom_blank"           "geom_boxplot"        
##  [7] "geom_col"             "geom_contour"         "geom_count"          
## [10] "geom_crossbar"        "geom_curve"           "geom_density"        
## [13] "geom_density_2d"      "geom_density2d"       "geom_dotplot"        
## [16] "geom_errorbar"        "geom_errorbarh"       "geom_freqpoly"       
## [19] "geom_hex"             "geom_histogram"       "geom_hline"          
## [22] "geom_jitter"          "geom_label"           "geom_line"           
## [25] "geom_linerange"       "geom_map"             "geom_path"           
## [28] "geom_point"           "geom_pointrange"      "geom_polygon"        
## [31] "geom_qq"              "geom_quantile"        "geom_raster"         
## [34] "geom_rect"            "geom_ribbon"          "geom_rug"            
## [37] "geom_segment"         "geom_smooth"          "geom_spoke"          
## [40] "geom_step"            "geom_text"            "geom_tile"           
## [43] "geom_violin"          "geom_vline"           "update_geom_defaults"

Scatterplot Revisited

ggplot(legosets, aes(x=Pieces, y=USD_MSRP)) + geom_point()

Scatterplot Revisited (cont.)

ggplot(legosets, aes(x=Pieces, y=USD_MSRP, color=Availability)) + geom_point()

Scatterplot Revisited (cont.)

ggplot(legosets, aes(x=Pieces, y=USD_MSRP, size=Minifigures, color=Availability)) + geom_point()

Scatterplot Revisited (cont.)

ggplot(legosets, aes(x=Pieces, y=USD_MSRP, size=Minifigures)) + geom_point() + facet_wrap(~ Availability)

Boxplots Revisited

ggplot(legosets, aes(x='Lego', y=USD_MSRP)) + geom_boxplot()

Boxplots Revisited (cont.)

ggplot(legosets, aes(x=Availability, y=USD_MSRP)) + geom_boxplot()

Boxplots Revisited (cont.)

ggplot(legosets, aes(x=Availability, y=USD_MSRP)) + geom_boxplot() + coord_flip()

Dual Scales

Some problems¹:

The designer has to make choices about scales and this can have a big impact on the viewer
"Cross-over points” where one series cross another are results of the design choices, not intrinsic to the data, and viewers (particularly unsophisticated viewers)
They make it easier to lazily associate correlation with causation, not taking into account autocorrelation and other time-series issues
Because of the issues above, in malicious hands they make it possible to deliberately mislead

library(DATA606)
shiny_demo('DualScales', package='DATA606')

My advise:

Avoid using them. You can usually do better with other plot types.
When necessary (or compelled) to use them, rescale (using z-scores)

¹ http://blog.revolutionanalytics.com/2016/08/dual-axis-time-series.html ² http://ellisp.github.io/blog/2016/08/18/dualaxes

Agenda

Introduction

Also a Father…

And photographer.

Syllabus

Course Calendar

Assignments

The DATA606 R Package

Important Functions

Using R Markdown

Working Directories

Intro to Data

Types of Variables

Types of Variables

Qualitative Variables

Contingency Tables

Proportional Tables

Bar Plots

Bar Plots

Mosaic Plot

Quantitative Variables

Measures of Center

Measures of Spread

The summary Function

The psych Package

Robust Statistics

Dot Plot

Dot Plot

Histograms

Transformations

Density Plots

Density Plot (log tansformed)

Box Plots

Scatter Plots

Examining Possible Outliers (expensive sets)

Examining Possible Outliers (big sets)

Likert Scales

likert R Package

likert Plots

likert Plots

likert Plots

Pie Charts

Pie Charts

Just say NO to pie charts!

Sampling vs. Census

Sampling Bias

Observational Studies vs. Experiments

Simple Random Sampling

Stratified Sampling

Cluster Sampling

Principles of experimental design

More experimental design terminology…

Random assignment vs. random sampling

ggplot2

First Example

Parts of a ggplot2 Statement

Lots of geoms

Scatterplot Revisited

Scatterplot Revisited (cont.)

Scatterplot Revisited (cont.)

Scatterplot Revisited (cont.)

Boxplots Revisited

Boxplots Revisited (cont.)

Boxplots Revisited (cont.)

Dual Scales

The `DATA606` R Package

The `summary` Function

The `psych` Package

`likert` R Package

`likert` Plots

`likert` Plots

`likert` Plots

`ggplot2`

Parts of a `ggplot2` Statement