Exploratory Data Analysis with R
With this RMarkdown document we will present some basic methods to overview given data.
The required R Packages are:
require(tidyverse) # tidyverse is the sine qua non, see https://www.tidyverse.org/ and includes important packages like tidyr, dplyr, ggplot2, readr, magrittr etc
require(skimr) # skimr creates nice overview tables and does with dplyr and piping
require(mosaic) # mosaic includes also ggformula and lattice packages
require(QuantPsyc) # QuantPsyc delivers useful statistical functions like "Make.Z" and "eda.uni"
Get the full glimpse
We will use the fancy starwars dataframe from {dplyr} package, and call it sw for the sake of convenience.
sw <- starwars # copy to "sw"
sw # just view
For plain R output there is a concise first viewing option of a dataframe with the glimpse command.
glimpse(sw)
## Rows: 87
## Columns: 14
## $ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia...
## $ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180...
## $ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, ...
## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown"...
## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light"...
## $ eye_color <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blu...
## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57....
## $ sex <chr> "male", "none", "none", "male", "female", "male", "femal...
## $ gender <chr> "masculine", "masculine", "masculine", "masculine", "fem...
## $ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan",...
## $ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "H...
## $ films <list> [<"The Empire Strikes Back", "Revenge of the Sith", "Re...
## $ vehicles <list> [<"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, ...
## $ starships <list> [<"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced ...
There are 14 variables, of which 3 are a list type. The other variables are numbers or characters, last ones partly better to be transformed into categorical (i.e. factor or dummy) variables.
Before we transform them, we might look at their unique values. Also "NA"s, R's term for non-existent values, are printed.
# show unique values for the specified variable, here columns 9, "gender" and 4, "hair_color" - which indeed is a tiny example of a for-loop
for (i in c(9,4)) {
unique(sw[,i]) %>% print.AsIs()
}
## gender
## 1 masculine
## 2 feminine
## 3 <NA>
## hair_color
## 1 blond
## 2 <NA>
## 3 none
## 4 brown
## 5 brown, grey
## 6 black
## 7 auburn, white
## 8 auburn, grey
## 9 white
## 10 grey
## 11 auburn
## 12 blonde
## 13 unknown
We might notice that there are 2 different "blond/blonde" values in hair_color. Let's change this!
We'll use the pipe %>% formula as we take the sw dataframe, do the mutation and write it back to sw.
Second, we turn hair_color into a factor variable.
sw <- sw %>% mutate(hair_color = replace(hair_color, which(hair_color=="blonde"), "blond"))
sw$hair_color <- as.factor(sw$hair_color)
Here's one more, very concise overview option (also using the pipe formula) -- which may help best at large datasets.
skim(sw) %>% summary()
Table: Data summary
| Name | sw |
| Number of rows | 87 |
| Number of columns | 14 |
| _____ | |
| Column type frequency: | |
| character | 7 |
| factor | 1 |
| list | 3 |
| numeric | 3 |
| ______ | |
| Group variables | None |
Before continue with overview statistics we might get clear about adressing and accessing parts of the dataframe. So-called subsetting a dataframe in R ist quite easy to do, you just need to adress the relevant rows and/or columns by numbers in squared brackets, and use the vector format c(x,y,z) for multiple ones. (Of course there are more elaborate options to do this, but for now we stick to shortness and the usage in the functions beneath.)
sw[1:5, c(1,6:4)] # first 5 rows of columns 1, 6, 5, 4
sw[3, -c(5:ncol(sw))] # row 3 with columns 1 to 4 by sorting out columns 5 to the last
Let's now get a better insight into certain variables. For this we use the skim function from the package {skimr}. (Of course you can also "skim" the whole dataframe if you want to: It might deliver a large table, anyway.) skim gives us very useful basic statistics for the chosen variables according to their type, and even a mini histogram where appropriate.
Therefore, skim is very useful at checking for missing values (and their share), whitespaces and unique values, and also a first glimpse of the single variate distributions.
# skimming a character, numerical, integer, factor variable
skim(sw, sex, birth_year, height, hair_color) # alternatively use column numbers like "skim(sw, c(8, 7, 2, 4))"
Table: Data summary
| Name | sw |
| Number of rows | 87 |
| Number of columns | 14 |
| _____ | |
| Column type frequency: | |
| character | 1 |
| factor | 1 |
| numeric | 2 |
| ______ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| sex | 4 | 0.95 | 4 | 14 | 0 | 4 | 0 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| hair_color | 5 | 0.94 | FALSE | 11 | non: 37, bro: 18, bla: 13, blo: 4 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| birth_year | 44 | 0.49 | 87.57 | 154.69 | 8 | 35 | 52 | 72 | 896 | ▇▁▁▁▁ |
| height | 6 | 0.93 | 174.36 | 34.77 | 66 | 167 | 180 | 191 | 264 | ▁▁▇▅▁ |
We also turn the other appropriate variables into factors and may check with glimpse one more time.
sw[,c(5:6,8:11)] <- lapply(sw[,c(5:6,8:11)], as.factor) # turning columns 5, 6 and 8 to 11 into factors
glimpse(sw)
## Rows: 87
## Columns: 14
## $ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia...
## $ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180...
## $ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, ...
## $ hair_color <fct> "blond", NA, NA, "none", "brown", "brown, grey", "brown"...
## $ skin_color <fct> "fair", "gold", "white, blue", "white", "light", "light"...
## $ eye_color <fct> blue, yellow, red, yellow, brown, blue, blue, red, brown...
## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57....
## $ sex <fct> male, none, none, male, female, male, female, none, male...
## $ gender <fct> masculine, masculine, masculine, masculine, feminine, ma...
## $ homeworld <fct> Tatooine, Tatooine, Naboo, Tatooine, Alderaan, Tatooine,...
## $ species <fct> Human, Droid, Droid, Human, Human, Human, Human, Droid, ...
## $ films <list> [<"The Empire Strikes Back", "Revenge of the Sith", "Re...
## $ vehicles <list> [<"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, ...
## $ starships <list> [<"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced ...
Inspect a subset
We might take a look at a specific subset of the whole dataset.
With base R sapply there is a single command to apply the same function to several variables or the whole dataframe.
sapply(sw[,c(4:6,8:11)], nlevels) # number of levels for the categoricals (columns 4 to 6 and 8 to 11)
## hair_color skin_color eye_color sex gender homeworld species
## 11 31 15 4 2 48 37
sapply(sw[,2:3], median, na.rm=T) # median score for the numericals (columns 2 and 3); "na.rm" for ignoring the NA's
## height mass
## 180 79
The function select from {dplyr} offers a kind way of including or excluding variables, which pays off best using it in building a data pipe. After selecting you can continue with various statistical functions like summary here.
sw %>% dplyr::select(-c(name,films, vehicles, starships, c(4:7))) %>% # exclude certain variables
summary() # statistical summaries for the left variables
## height mass sex gender
## Min. : 66.0 Min. : 15.00 female :16 feminine :17
## 1st Qu.:167.0 1st Qu.: 55.60 hermaphroditic: 1 masculine:66
## Median :180.0 Median : 79.00 male :60 NA's : 4
## Mean :174.4 Mean : 97.31 none : 6
## 3rd Qu.:191.0 3rd Qu.: 84.50 NA's : 4
## Max. :264.0 Max. :1358.00
## NA's :6 NA's :28
## homeworld species
## Naboo :11 Human :35
## Tatooine :10 Droid : 6
## Alderaan : 3 Gungan : 3
## Coruscant: 3 Kaminoan: 2
## Kamino : 3 Mirialan: 2
## (Other) :47 (Other) :35
## NA's :10 NA's : 4
Looking at individual variables
The tiny str command will generally check of which type or structure a variable or any R object is, with it's sibling summary it is also used beyond EDA.
str(sw$sex)
## Factor w/ 4 levels "female","hermaphroditic",..: 3 4 4 3 1 3 1 4 3 3 ...
str(sw$name)
## chr [1:87] "Luke Skywalker" "C-3PO" "R2-D2" "Darth Vader" "Leia Organa" ...
str(sw$birth_year)
## num [1:87] 19 112 33 41.9 19 52 47 NA 24 57 ...
Numerical (continuous) variables
For numerical variables there are dedicated statistical commands like:
summary(sw$birth_year)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 8.00 35.00 52.00 87.57 72.00 896.00 44
range(sw$birth_year, na.rm=T) # note another option for dealing with NA's
## [1] 8 896
quantile(sw$height, probs=0.8, na.rm=T) # 8th quantile
## 80%
## 193
top_counts(sw$height, max_levels=5) # top 5 values with counts
## [1] "183: 7, 180: 5, 188: 5, 170: 4, 178: 4"
Also, there are options which deliver the numbers in particular output formats, like skim we already know. df_stats from the {ggformula} package creates a dataframe, and can also be adapted.
skim(sw$birth_year) # for further options with skim see above
Table: Data summary
| Name | sw$birth_year |
| Number of rows | 87 |
| Number of columns | 1 |
| _____ | |
| Column type frequency: | |
| numeric | 1 |
| ______ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| data | 44 | 0.49 | 87.57 | 154.69 | 8 | 35 | 52 | 72 | 896 | ▇▁▁▁▁ |
df_stats(~ mass, data=sw) # standard view by df_stats
df_stats(~ mass, data=sw, quantile, iqr, mad) # adapted df_stats output
## Warning: Excluding 28 rows due to missing data [df_stats()].
Turning numerical values into standard normal values can be done in a quite a simple way using the Make.Z command, or alternatively with scale. Both can be also applied on whole dataframes.
sw$heightZ <- Make.Z(sw$height) # converting into z-scores (standard-normal)
sw$massZ <- scale(sw$mass) # same for mass, with scale()
# as.data.frame(Make.Z(x)) # back converting into dataframe, when Make.Z is applied to a whole dataframe
Categorical (factor) variables
For factor variables there are several ways for inspecting them.
summary(sw$gender) # levels with counts, with NA's
## feminine masculine NA's
## 17 66 4
mosaic::tally(~gender, data=sw) # gives a nice print out
## gender
## feminine masculine <NA>
## 17 66 4
levels(sw$gender) # only levels, without counts, without NA's
## [1] "feminine" "masculine"
skim(sw$gender) # for further options with skim see above
Table: Data summary
| Name | sw$gender |
| Number of rows | 87 |
| Number of columns | 1 |
| _____ | |
| Column type frequency: | |
| factor | 1 |
| ______ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| data | 4 | 0.95 | FALSE | 2 | mas: 66, fem: 17 |
Why it is worth knowing various of them? For instance, you can use them to address particular data.
levels(sw$gender)[1]
## [1] "feminine"
summary(sw$gender)[3]
## NA's
## 4
For combining in 2 factor variables into a cross table you can use table or the tally command from {mosaic} package -- they differ in handling the NA's.
table(sw$gender, sw$sex) # crosstable with NA's being ignored; more options are available with xtabs()
##
## female hermaphroditic male none
## feminine 16 0 0 1
## masculine 0 1 60 5
mosaic::tally(gender ~ sex, data=sw) # crosstable including NA's
## sex
## gender female hermaphroditic male none <NA>
## feminine 16 0 0 1 0
## masculine 0 1 60 5 0
## <NA> 0 0 0 0 4
There is also a way to have an insight of the distribution of factor variables using the top_levels command from the {janitor} package.
Note how we use a command from a package NOT sourced in this R/RMarkdown script -- that's a stunning feature of R.
janitor::top_levels(sw$eye_color, n=3, show_na=T) # n = number in top & bottom group
Remember skim above? Of course you can use skim for inspecting a single variable. But you can even adjust skim to your own needs. Just write a small function with the skim_with command like "my_skim" with MAD, IQR and 10th quantile here.
my_skim <- skim_with(
numeric = sfl(mad = mad, iqr = IQR, p10 = ~ quantile(., probs = .1)), append = F)
# my_skim(sw$mass) # What's the problem here?
Different R functions ignore or don't ignore NA's in calculating, as we might have seen. Often bug fixing is just to take NA's explicitly into account, for instance with the na.omit command.
my_skim(na.omit(sw$mass))
Table: Data summary
| Name | na.omit(sw$mass) |
| Number of rows | 59 |
| Number of columns | 1 |
| _____ | |
| Column type frequency: | |
| numeric | 1 |
| ______ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mad | iqr | p10 |
|---|---|---|---|---|---|
| data | 0 | 1 | 16.31 | 28.9 | 44 |
Grouped variables statistics
The {dplyr} function group_by easily splits the dataframe into groups along certain variables (factors and numerical!) and works fine with the pipe.
sw %>% dplyr::group_by(gender) %>% # building groups by gender
dplyr::select(c(2:4)) %>% # reducing the output by select only a few columns
skim() # skim overview according to the gender groups
## Adding missing grouping variables: `gender`
Table: Data summary
| Name | Piped data |
| Number of rows | 87 |
| Number of columns | 4 |
| _____ | |
| Column type frequency: | |
| factor | 1 |
| numeric | 2 |
| ______ | |
| Group variables | gender |
Variable type: factor
| skim_variable | gender | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|---|
| hair_color | feminine | 0 | 1.00 | FALSE | 6 | bro: 6, non: 5, bla: 3, aub: 1 |
| hair_color | masculine | 5 | 0.92 | FALSE | 9 | non: 31, bro: 11, bla: 9, blo: 3 |
| hair_color | NA | 0 | 1.00 | FALSE | 4 | bla: 1, bro: 1, non: 1, unk: 1 |
Variable type: numeric
| skim_variable | gender | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|---|
| height | feminine | 1 | 0.94 | 164.69 | 23.55 | 96 | 161.50 | 165.5 | 172.0 | 213 | ▁▁▇▆▁ |
| height | masculine | 4 | 0.94 | 176.52 | 37.65 | 66 | 171.25 | 183.0 | 193.0 | 264 | ▂▁▇▇▁ |
| height | NA | 1 | 0.75 | 181.33 | 2.89 | 178 | 180.50 | 183.0 | 183.0 | 183 | ▃▁▁▁▇ |
| mass | feminine | 8 | 0.53 | 54.69 | 8.59 | 45 | 50.00 | 55.0 | 56.2 | 75 | ▇▇▁▁▂ |
| mass | masculine | 17 | 0.74 | 106.15 | 184.97 | 15 | 75.00 | 80.0 | 88.0 | 1358 | ▇▁▁▁▁ |
| mass | NA | 3 | 0.25 | 48.00 | NA | 48 | 48.00 | 48.0 | 48.0 | 48 | ▁▁▇▁▁ |
We've already heard about the df_stats for single variables. Bit it's also very useful for variables grouped by another, as here height by gender groups. Note that NA's are ignored by default.
df_stats(height ~ gender, data=sw) # standard view by df_stats
And this works for grouping along continuous variables, too -- and of course with specifying the statistical values.
df_stats(height ~ mass, data=sw, mean, range) # mean and range of height by mass "classes"
## Warning: Excluding 28 rows due to missing data [df_stats()].
Plots
Numbers are fine, but one should always take a look at plots where possible. Note that the plotting options presented here are not primarily meant for a beautiful data visualisation, but for inspecting the given data.
The easiest way to get a picture of possible correlations ist to use plot on the whole dataframe or the variables to take into account. Here, we chose to check height, mass, sex, gender and species, and dropped the 16th observation, which is "Jabba the Hutt", presumably an outlier with his exceptional body shape.
plot(sw[-16,c(2,3,8,9, 11)], pch=20) # "pch" sets the shape of the points; alternatively use "pairs"

pairs.panels from the {psych} package is a more elaborative form of it, including correlation scores and various output/layout options. These may help especially when inspecting more variables at once, for instance by starring or scaling up the font of high correlation coefs.
psych::pairs.panels(sw[-16,c(2,3,8,9, 11)], method = "pearson", density=T, stars=T, ellipses=F) # also available: spearman and kendall correlation

psych::pairs.panels(sw[-16,1:11], method = "pearson", density=T, scale=T, ellipses=F, stars=T)

The eda.uni function in {QuantPsyc} is a good start for inspecting univariate distributions, as it combines four relevant plots: histogram, density plot, normal plot and boxplot.
eda.uni(sw$height, title="Univariate Distribution of Height") # 4 plots in one :)

There are various plot options, which can be parametrised in numerous ways. plot as the simplest of them will adjust to the given variable type/types.
par(mfrow=c(2,2)) # sets visual parameter for showing the following plots in a 2x2 grid pattern
plot(sw$mass, pch=20, col="orange"); lines(smooth(na.omit(sw$mass))) # smooth = Tukey's (running median) smoothing; writing 2 lines of code into one by separating them with ";"
plot(sw$sex, col=3) # colour can also be given as integer
plot(sw$mass[sw$mass < 1000], sw$height[sw$mass < 1000], pch=15) # with subsetting the data; "pch" parameter refers to the shape of points
plot(sw$gender, sw$height, pch=20, main="Height in Star Wars characters", horizontal=T)

Certain plots can be addressed by basic commands like hist, boxplot, barplot or mosaicplot.
par(mfrow=c(2,2))
hist(sw$height, breaks=5, col=18) # with given number of bins
boxplot(sw$height[sw$sex=="female"], horizontal = T, pch=20, main="Height in female Star Wars characters") # with female sex filter
barplot(sw$height, col=7, border=7) # border sets the colour of the shape margins
mosaicplot(table(sw$sex, sw$gender), col=5:6)

histogram and bwplot from the {lattice} package offer easy plots for grouped variables.
Note that adressing a function with its full package name like package::command makes sure to get the right command if there are homonymous ones.
lattice::histogram(~height|sex, data=sw) # histogram of height by sex groups

lattice::bwplot(sex~height, data=sw) # boxplot of height by sex groups

As noted before there are far more elaborate options to plot data, first and foremost with the {ggplot2} package. We won't dive into this, but give a short view on it by using the {ggfomula} shortcuts for it. With their shortened grammar there are also a good compromise between quickness and nice view for EDA.
# further ggformula plots are for instance gf_histogram(), gf_dens(), gf_freqpoly(), gf_dotplot(), gf_bar(), gf_violin, along several layout options
gf_qq(~ height, data=sw) # normal plot for height

gf_point(mass ~ height | sex, data=sw, size=2, col=2) # height vs. mass grouped by sex

gf_density(~ height, colour=~sex, data=sw, alpha=0.5, fill=~sex) # density plot for height gouped by sex

gf_smooth(mass ~ height, color=~sex, data=sw, size = 1) # loess is standard smoothing method
