Course Outline

segmentGetting Started (Don't Skip This Part)

segmentIntroduction to Statistics: A Modeling Approach

segmentPART I: EXPLORING VARIATION

segmentChapter 1  Welcome to Statistics: A Modeling Approach

segmentChapter 2  Understanding Data

segmentChapter 3  Examining Distributions

3.8 Exploring Variation in Categorical Variables

segmentChapter 4  Explaining Variation

segmentPART II: MODELING VARIATION

segmentChapter 5  A Simple Model

segmentChapter 6  Quantifying Error

segmentChapter 7  Adding an Explanatory Variable to the Model

segmentChapter 8  Models with a Quantitative Explanatory Variable

segmentPART III: EVALUATING MODELS

segmentChapter 9  Distributions of Estimates

segmentChapter 10  Confidence Intervals and Their Uses

segmentChapter 11  Model Comparison with the F Ratio

segmentChapter 12  What You Have Learned

segmentResources
list Introduction to Statistics: A Modeling Approach
Exploring Variation in Categorical Variables
So far we have focused on examining the distributions of quantitative variables. Our methods for examining distributions differ, however, depending on whether a variable is quantitative or categorical.
L_Ch3_Exploring_1
Although we are tempted to call this a roughly normal or bellshaped distribution, isn’t it a little strange to say the center of this distribution is Asian? What is the range of this distribution? Is it Other minus African American? Something about these descriptions of this distribution seems off!
We have thus far used histograms to examine the distribution of a variable. But histograms aren’t appropriate if the variable is categorical. And if R knows it’s categorical (if, for example, you have specified it as a factor), it won’t even run the histogram and will give you an error message instead.
Bar Graphs
When a variable is categorical you can visualize the distribution with a bar graph. It looks like a histogram, but it’s not. There is no such thing as bins, for example, in a bar graph. The number of bars in a bar graph will always equal the number of categories in your variable.
So let’s take a look at variables like Sex and RaceEthnic from the data frame Fingers. These have been specified as factors and the levels have been labeled already.
Here’s the code for making a bar graph in R:
gf_bar( ~ Sex, data = Fingers)
Use the DataCamp window below to create a bar graph of RaceEthnic. (Again, use the gf_bar()
command.)
# Load packages
require(mosaic)
require(tidyverse)
require(Lock5withR)
require(supernova)
# Create a bar graph of RaceEthnic in the Fingers data frame. Use the gf_bar() function
# Create a bar graph of RaceEthnic in the Fingers data frame. Use the gf_bar() function
gf_bar(~RaceEthnic, data=Fingers)
test_function("gf_bar", args="data", incorrect_msg="Don't forget to set `data=Fingers`")
test_function_result("gf_bar", incorrect_msg="Did you use `~RaceEthnic`?")
test_error()
success_msg("Your hard work really shows. Great job!")
You can change the width of these bars by adding the argument width and setting it to some number between 0 and 1.
L_Ch3_Exploring_2
You learned about the arguments color and fill for gf_histogram()
. It turns out that these arguments work the same way for gf_bar()
. Try playing with the colors here.
# Load packages
require(mosaic)
require(tidyverse)
require(supernova)
require(Lock5withR)
# add arguments color and fill to these bar graphs
gf_bar(~ Sex, data = Fingers)
gf_bar(~ RaceEthnic, data = Fingers)
# add arguments color and fill to these bar graphs
# the colors below are just examples. Any color is acceptable as long as you use the color and fill arguments
gf_bar(~ Sex, data = Fingers, color="darkgray", fill="mistyrose" )
gf_bar(~ RaceEthnic, data = Fingers, color="darkgray", fill="mistyrose")
test_function("gf_bar", args=c("color", "fill"), index=1, eval=NA,
not_called_msg="Did you keep the function `gf_bar`?",
args_not_specified_msg="Have you specified both the color and the fill?")
test_function("gf_bar", args=c("color", "fill"), index=2, eval=NA,
not_called_msg="Did you keep the second function call of `gf_bar`?",
args_not_specified_msg="Have you specified both the color and the fill for the second bar graph?")
test_error()
success_msg("Nice job!")
Shape, Center, and Spread
Visualizing the distributions of categorical variables is just as important as visualizing the distributions of quantitative variables. However, the features to look at need to be adjusted a little.
Shape of the distribution of a categorical variable doesn’t really make sense. Just by reordering the bars, you can alter the shape. So, we don’t want to make much about the shape of the distribution of a categorical variable.
Both center and spread are still worth noting. In some ways, center is easier to determine in a categorical variable than it is in a quantitative variable. The category with the most cases is the center; it’s where most observations are. This is also called the mode of the distribution—the most frequent score.
Spread is a way of characterizing how well distributed the cases are across the categories. Do most observations fall in one category, or is there an even distribution across all the categories?
Frequency Tables
Categorical variables can also be summarized with frequency tables (also called tallies). We’ve used the tally()
command before. Use tally()
to look at the distributions of Sex and RaceEthnic from the data frame Fingers.
# Load packages
require(mosaic)
require(tidyverse)
require(supernova)
require(Lock5withR)
# Clean up the data
# This creates a frequency table of Sex
tally(~ Sex, data = Fingers)
# Write code to create a frequency table of RaceEthnic
# This creates a frequency table of Sex
tally(~ Sex, data = Fingers)
# Write code to create a frequency table of RaceEthnic
tally(~ RaceEthnic, data = Fingers)
test_function("tally", args="data", index=1, incorrect_msg="Make sure not to change the code for the first frequency table")
test_function_result("tally", index=1)
test_function("tally", args="data", index=2, incorrect_msg="Have you set `data=Fingers`?")
test_function_result("tally", index=2, incorrect_msg="Have you specified `~RaceEthnic`?")
test_error()
success_msg("You're a supeR coder!")
L_Ch3_Exploring_3
Sometimes you may also want the total across all levels of the variable. Because this total is presented outside the main breakdown in the tally table we say it is in the “margin.”" You can get totals by adding margins
as an argument to tally()
.
tally(~ Sex, data = Fingers, margins = TRUE)
More often than not, it may be more useful to look at the relative frequencies, or proportions than these counts. We can add an argument to tally()
to get these same data in that format.
tally(~ Sex, data = Fingers, format = "proportion")
Try including both format
and margin
as arguments for a tally of RaceEthic. What do you predict the total proportion (in the margin) will be?
# Load packages
require(mosaic)
require(tidyverse)
require(supernova)
require(Lock5withR)
# Add margin and format arguments to the tally() function. Set margins to TRUE and format to proportion
tally(~ RaceEthnic, data = Fingers)
# Add margin and format arguments to the tally() function. Set margin to TRUE and format to proportion
tally(~ RaceEthnic, data = Fingers, margins = TRUE, format = "proportion")
test_function("tally", args=c("margins", "format"))
test_error()
success_msg("You're doing fantastic work!")
Do you think we could also use tally()
for quantitative variables like Thumb? Let’s try it here. Write code for creating a frequency table of Thumb.
# Load packages
require(mosaic)
require(tidyverse)
require(supernova)
require(Lock5withR)
# write code to create a frequency table of Thumb
# write code to create a frequency table of Thumb
tally(~Thumb, data=Fingers)
test_function("tally", args=c("x", "data"))
test_function_result("tally")
test_error()
success_msg("Awesome!")
L_Ch3_Exploring_4
Distributions of categorical variables are best explored with frequency tables (tallies) and bar graphs; distributions of quantitative variables are better explored with histograms and box plots. For both kinds of variables, one might choose to use raw frequencies or one might choose relative frequencies (such as density or proportion), depending on the purpose. It is important to have a comprehensive toolbox for examining all kinds of variables.