Descriptive Statistics
Descriptive statistics and data visualizations help researchers summarize and describe characteristics of their data, check for obvious irregularities, identify patterns, and gather important insight into how these data will behave during subsequent modeling efforts.
Frequency Tables
A frequency table displays data according to the number of times (frequency) in which these data exist in the dataset. The table is arranged in ascending order of magnitude.
Frequency tables should include the following columns:
x = Values of the raw data, arranged from high to low. These values are discrete and typically represented as factors or discrete numbers.
freq_x = Frequency of x.
freq_percent = Percentage of observations at each value of x.
cum_freq = Cumulative frequency of x. Shows the number of participants at or below the given value of x.
cum_percent = Cumulative percentage of observations that fall at or below the given value of x.
As an example, we will use the diamonds
dataset from the {ggplot2}
package to create a frequency table of the cut variable.
data("diamonds")
%>%
diamonds mutate(cut = as_factor(cut)) %>%
group_by(cut) %>%
count() %>%
ungroup() %>%
rename(x = cut, freq = n) %>%
arrange(freq) %>%
mutate(freq_percent = scales::percent(freq/sum(.$freq)),
cum_freq = cumsum(freq),
cum_percent = scales::percent(cum_freq/sum(.$freq))) %>%
arrange(-freq)
## # A tibble: 5 × 5
## x freq freq_percent cum_freq cum_percent
## <ord> <int> <chr> <int> <chr>
## 1 Ideal 21551 40.0% 53940 100.0%
## 2 Premium 13791 25.6% 32389 60.0%
## 3 Very Good 12082 22.4% 18598 34.5%
## 4 Good 4906 9.1% 6516 12.1%
## 5 Fair 1610 3.0% 1610 3.0%
Data Visualization
Three methods are typically used for frequency visualization:
Bar charts (nominal level data)
Histograms (ordinal level data or discrete interval, ratio level data)
Frequency polygons (used for continuous interval, ratio level data).
Each of these visualizations can be created using the {ggplot2}
package. We will use the diamonds dataset to explore each of these visualizations.
%>%
diamonds ggplot(aes(x = cut)) +
geom_bar() +
labs(x = "cut", y = "frequency", title = "Bar Chart of Nominal Level Data")
%>%
diamonds ggplot(aes(x = carat)) +
geom_histogram() +
labs(x = "carat", y = "frequency", title = "Histogram of Discrete, Ratio Level Data")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
%>%
diamonds ggplot(aes(x = price)) +
geom_freqpoly() +
labs(x = "price", y = "frequency", title = "Frequency Polygon of Continuous, Ratio Level Data")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.