class: center, middle, inverse, title-slide # Exploratory Data Analysis and Why visualize ?
📊 --- layout: true <div class="my-footer"> <span> <a href="" target="_blank"> </a> </span> </div> --- --- class: center, middle # Exploratory data analysis --- ## What is EDA? - Exploratory data analysis (EDA) is an aproach to analyzing data sets to summarize its main characteristics. - Often, this is visual. That's what we're focusing on today. - But we might also calculate summary statistics and perform data wrangling/manipulation/transformation at (or before) this stage of the analysis. That's what we'll focus on next. --- --- class: center, middle # Star Wars Dataset --- ## Dataset terminology - Each row is an **observation** - Each column is a **variable** .small[ ```r starwars ``` ``` ## # A tibble: 87 x 14 ## name height mass hair_color skin_color eye_color birth_year sex gender ## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> ## 1 Luke… 172 77 blond fair blue 19 male mascu… ## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu… ## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu… ## 4 Dart… 202 136 none white yellow 41.9 male mascu… ## 5 Leia… 150 49 brown light brown 19 fema… femin… ## 6 Owen… 178 120 brown, gr… light blue 52 male mascu… ## 7 Beru… 165 75 brown light blue 47 fema… femin… ## 8 R5-D4 97 32 <NA> white, red red NA none mascu… ## 9 Bigg… 183 84 black light brown 24 male mascu… ## 10 Obi-… 182 77 auburn, w… fair blue-gray 57 male mascu… ## # … with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>, ## # films <list>, vehicles <list>, starships <list> ``` ] --- ## Luke Skywalker ![luke-skywalker](img/luke-skywalker.png) --- ## What's in the Star Wars data? Take a `glimpse` at the data: ```r glimpse(starwars) ``` ``` ## Rows: 87 ## Columns: 14 ## $ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia O… ## $ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, … ## $ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77… ## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", … ## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", … ## $ eye_color <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue"… ## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0,… ## $ sex <chr> "male", "none", "none", "male", "female", "male", "female"… ## $ gender <chr> "masculine", "masculine", "masculine", "masculine", "femin… ## $ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "… ## $ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Hum… ## $ films <list> [<"The Empire Strikes Back", "Revenge of the Sith", "Retu… ## $ vehicles <list> [<"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "I… ## $ starships <list> [<"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1… ``` --- ## What's in the Star Wars data? .question[ How many rows and columns does this dataset have? What does each row represent? What does each column represent? ] ```r ?starwars ``` ![](img/starwars-help.png)<!-- --> --- class: center, middle # EDA: 1. Identifying variables --- ## Number of variables involved * Univariate data analysis - distribution of single variable * Bivariate data analysis - relationship between two variables * Multivariate data analysis - relationship between many variables at once, usually focusing on the relationship between two while conditioning for others --- ## Types of variables - **Numerical variables** can be classified as **continuous** or **discrete** based on whether or not the variable can take on an infinite number of values or only non-negative whole numbers, respectively. - If the variable is **categorical**, we can determine if it is **ordinal** based on whether or not the levels have a natural ordering. --- class: center, middle # Visualizing numerical data --- ## Describing shapes of numerical distributions * shape: * skewness: right-skewed, left-skewed, symmetric (skew is to the side of the longer tail) * modality: unimodal, bimodal, multimodal, uniform * center: mean (`mean`), median (`median`), mode (not always useful) * spread: range (`range`), standard deviation (`sd`), inter-quartile range (`IQR`) * unusal observations --- ## Histograms ```r ggplot(data = starwars, mapping = aes(x = height)) + geom_histogram(binwidth = 10) ``` ``` ## Warning: Removed 6 rows containing non-finite values (stat_bin). ``` <img src="02-WhyVisualize_files/figure-html/unnamed-chunk-6-1.png" width="75%" /> --- ## Density plots ```r ggplot(data = starwars, mapping = aes(x = height)) + geom_density() ``` ``` ## Warning: Removed 6 rows containing non-finite values (stat_density). ``` <img src="02-WhyVisualize_files/figure-html/unnamed-chunk-7-1.png" width="75%" /> --- ## Side-by-side box plots ```r ggplot(data = starwars, mapping = aes(y = height, x = gender)) + geom_boxplot() ``` ``` ## Warning: Removed 6 rows containing non-finite values (stat_boxplot). ``` <img src="02-WhyVisualize_files/figure-html/unnamed-chunk-8-1.png" width="75%" /> --- class: center, middle # Visualizing categorical data --- ## Bar plots ```r ggplot(data = starwars, mapping = aes(x = gender)) + geom_bar() ``` <img src="02-WhyVisualize_files/figure-html/unnamed-chunk-9-1.png" width="80%" /> --- ## Segmented bar plots, counts ```r ggplot(data = starwars, mapping = aes(x = gender, fill = hair_color)) + geom_bar() ``` <img src="02-WhyVisualize_files/figure-html/unnamed-chunk-10-1.png" width="80%" /> --- ## Recode hair color Using the `fct_other()` function from the **forcats** package, which is also part of the **tidyverse**. ```r starwars <- starwars %>% mutate(hair_color2 = fct_other(hair_color, keep = c("black", "brown", "brown", "blond") ) ) ``` --- ## Segmented bar plots, counts ```r ggplot(data = starwars, mapping = aes(x = gender, fill = hair_color2)) + geom_bar() + coord_flip() ``` <img src="02-WhyVisualize_files/figure-html/unnamed-chunk-12-1.png" width="70%" /> --- ## Segmented bar plots, proportions ```r ggplot(data = starwars, mapping = aes(x = gender, fill = hair_color2)) + geom_bar(position = "fill") + coord_flip() ``` <img src="02-WhyVisualize_files/figure-html/unnamed-chunk-13-1.png" width="70%" /> ```r labs(y = "proportion") ``` ``` ## $y ## [1] "proportion" ## ## attr(,"class") ## [1] "labels" ``` --- .question[ Which bar plot is a more useful representation for visualizing the relationship between gender and hair color? ] <img src="02-WhyVisualize_files/figure-html/unnamed-chunk-14-1.png" width="50%" /> <img src="02-WhyVisualize_files/figure-html/unnamed-chunk-15-1.png" width="50%" /> --- --- class: center, middle # Why do we visualize? --- ## Data: `datasaurus_dozen` Below is an exceprt from the `datasaurus_dozen` dataset: ``` ## # A tibble: 142 x 8 ## away_x away_y bullseye_x bullseye_y circle_x circle_y dino_x dino_y ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 32.3 61.4 51.2 83.3 56.0 79.3 55.4 97.2 ## 2 53.4 26.2 59.0 85.5 50.0 79.0 51.5 96.0 ## 3 63.9 30.8 51.9 85.8 51.3 82.4 46.2 94.5 ## 4 70.3 82.5 48.2 85.0 51.2 79.2 42.8 91.4 ## 5 34.1 45.7 41.7 84.0 44.4 78.2 40.8 88.3 ## 6 67.7 37.1 37.9 82.6 45.0 77.9 38.7 84.9 ## 7 53.3 97.5 39.5 80.8 48.6 78.8 35.6 79.9 ## 8 63.5 25.1 39.6 82.7 42.1 76.9 33.1 77.6 ## 9 68.0 81.0 34.8 80.0 41.0 76.4 29.0 74.5 ## 10 67.4 29.7 27.6 72.8 34.6 72.7 26.2 71.4 ## # … with 132 more rows ``` --- ## Summary statistics .small[ ```r datasaurus_dozen %>% group_by(dataset) %>% summarise(r = cor(x, y)) ``` ``` ## `summarise()` ungrouping output (override with `.groups` argument) ``` ``` ## # A tibble: 13 x 2 ## dataset r ## <chr> <dbl> ## 1 away -0.0641 ## 2 bullseye -0.0686 ## 3 circle -0.0683 ## 4 dino -0.0645 ## 5 dots -0.0603 ## 6 h_lines -0.0617 ## 7 high_lines -0.0685 ## 8 slant_down -0.0690 ## 9 slant_up -0.0686 ## 10 star -0.0630 ## 11 v_lines -0.0694 ## 12 wide_lines -0.0666 ## 13 x_shape -0.0656 ``` ] --- ## .question[ How similar do the relationships between `x` and `y` in the thirteen datasets look? How similar are they based on summary stats? ] <img src="02-WhyVisualize_files/figure-html/datasaurus-plot-1.png" width="2100" /> --- ## Anscombe's quartet ```r library(Tmisc) quartet ``` .pull-left[ ``` ## set x y ## 1 I 10 8.04 ## 2 I 8 6.95 ## 3 I 13 7.58 ## 4 I 9 8.81 ## 5 I 11 8.33 ## 6 I 14 9.96 ## 7 I 6 7.24 ## 8 I 4 4.26 ## 9 I 12 10.84 ## 10 I 7 4.82 ## 11 I 5 5.68 ## 12 II 10 9.14 ## 13 II 8 8.14 ## 14 II 13 8.74 ## 15 II 9 8.77 ## 16 II 11 9.26 ## 17 II 14 8.10 ## 18 II 6 6.13 ## 19 II 4 3.10 ## 20 II 12 9.13 ## 21 II 7 7.26 ## 22 II 5 4.74 ``` ] .pull-right[ ``` ## set x y ## 23 III 10 7.46 ## 24 III 8 6.77 ## 25 III 13 12.74 ## 26 III 9 7.11 ## 27 III 11 7.81 ## 28 III 14 8.84 ## 29 III 6 6.08 ## 30 III 4 5.39 ## 31 III 12 8.15 ## 32 III 7 6.42 ## 33 III 5 5.73 ## 34 IV 8 6.58 ## 35 IV 8 5.76 ## 36 IV 8 7.71 ## 37 IV 8 8.84 ## 38 IV 8 8.47 ## 39 IV 8 7.04 ## 40 IV 8 5.25 ## 41 IV 19 12.50 ## 42 IV 8 5.56 ## 43 IV 8 7.91 ## 44 IV 8 6.89 ``` ] --- ## Summarising Anscombe's quartet ```r quartet %>% group_by(set) %>% summarise( mean_x = mean(x), mean_y = mean(y), sd_x = sd(x), sd_y = sd(y), r = cor(x, y) ) ``` ``` ## `summarise()` ungrouping output (override with `.groups` argument) ``` ``` ## # A tibble: 4 x 6 ## set mean_x mean_y sd_x sd_y r ## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 I 9 7.50 3.32 2.03 0.816 ## 2 II 9 7.50 3.32 2.03 0.816 ## 3 III 9 7.5 3.32 2.03 0.816 ## 4 IV 9 7.50 3.32 2.03 0.817 ``` --- ## Visualizing Anscombe's quartet ```r ggplot(quartet, aes(x = x, y = y)) + geom_point() + facet_wrap(~ set, ncol = 4) ``` <img src="02-WhyVisualize_files/figure-html/quartet-plot-1.png" width="75%" /> --- ## Age at first kiss .question[ Do you see anything out of the ordinary? ] <img src="img/age_firstkiss.png" width="90%" /> --- ## Facebook visits .question[ How are people reporting lower vs. higher values of FB visits? ] <img src="img/facebook_visits.png" width="90%" />