Chapter 10 dplyr - data manipulation
While base R
has many tools that can do the job, dplyr
and other Tidyverse packages can easily work together and allow the easy creation of pipelines to accomplish a task as was demonstration in the previous chapter 8.3.1.
In this chapter we’ll explore a few dplyr
verbs
- select to choose columns
- filter to check data on rows
- arrange to order data
- mutate to compute new values
- summarise to create condensed data
- group_by to select specific data
In addition we’ll learn about some conditional selection within the functions described by these verbs.
We’ll use the Master4
large data file that was created previously. (See all download and merge code in Appendix D if you need to recreated it.) Master4
is a merge, in order of the following datasets (links are to the documentation):
10.1 selecting columns
To limit the output length most examples will be piped in the head()
function to only print the column headers followed by 6 data lines. %>% head()
may terminate the command but the pipeline can be extended further.
The selection of columns is easy and just requires the name of the data frame containing the data and specifying the column names that we want to keep for further use in an analysis. The total length (number of rows) would be printed, here limited by piping into head()
.
SEQN LBXMFOS URXUCR BMXBMI
1 83732 NA 41 27.8
2 83733 NA 181 30.8
3 83734 NA 70 28.8
4 83735 NA 102 42.4
5 83736 0.6 315 20.3
6 83737 NA 64 28.6
We can note that in column LBXMFOS
there are 5 NA
, in the next section we’ll see how to get rid of them.
Note that the square bracket subsetting also works, for example to select the first 7 columns we could write select(Master4, 1:7)
which would be equivallent to Master4[, 1:7]
in this case.
10.2 Filtering rows
The filtering of rows depends on the desired outcome. One step that is useful and often necessary is to remove the NA
values. If we continue with the selected columns we can remove those with:
SEQN LBXMFOS URXUCR BMXBMI
1 83736 0.6 315 20.3
Since the head()
function was first, we are only filtering the first 6 rows, and 5 are therefore eliminated. If we move the head()
function after, we’ll select the first 6 rows that do not have any NA
.
SEQN LBXMFOS URXUCR BMXBMI
1 83736 0.6 315 20.3
2 83745 0.8 178 25.0
3 83750 1.9 81 24.1
4 83754 5.4 148 43.7
5 83762 0.4 317 38.0
6 83767 1.0 65 26.3
We can use other operator (Appendix B) to filter the rows with conditional statements. For example we could ask to keep the BMI values below or equal to 25.0 and more, and at the same time removing NA
values from chosen columns:
SEQN LBXMFOS URXUCR BMXBMI
1 83736 0.6 315 20.3
2 83738 NA 100 18.1
3 83739 NA 25 15.7
4 83745 0.8 178 25.0
5 83746 NA 34 16.1
6 83748 NA 14 16.1
We could have the filter()
function more than once within the pipeline, but we can also have multiple statements at the same time:
select(Master4, SEQN, LBXMFOS, URXUCR, BMXBMI) %>%
filter(!is.na(BMXBMI),
BMXBMI <= 25.0,
!is.na(LBXMFOS),
URXUCR == 25.0) %>%
head()
SEQN LBXMFOS URXUCR BMXBMI
1 85253 2.8 25 24.2
2 86086 1.1 25 20.3
3 88829 0.5 25 20.1
4 89326 0.7 25 22.9
5 89399 1.1 25 24.0
6 90178 0.6 25 20.4
Note on style: writing each conditional statement on a separate line makes it easier to read and understand the code.
The dyplr
function drop_na()
would also remove NA
from all rows, or specified rows.
10.3 Arrange data
For dplyr
to arrange data is to sort data in order. The following example accomplishes many things at the same time and the “seed” makes the results always the same. To see just 6 lines of ordered data based on age, the data is first randomly sampled to keep only 100 rows with a dyplr function sample_n()
. All rows in the age column RIDAGEYR
without value are removed with drop_na()
. Four columns are selected, and the age is filtered to keep only ages above 12. The data is then arranged (ordered) by the first (age) and then second column (creatinine) as designated. The data is first ordered by age and then by the second parameter.
set.seed(18) ;
Master4 %>%
sample_n(100) %>%
drop_na(RIDAGEYR) %>%
select(SEQN, URXUCR, BMXBMI, RIDAGEYR) %>%
filter(RIDAGEYR >= 12) %>%
arrange(RIDAGEYR, URXUCR) %>%
head()
SEQN URXUCR BMXBMI RIDAGEYR
1 89664 114 16.1 12
2 91777 148 20.3 13
3 90183 330 26.2 13
4 92077 116 17.8 14
5 84253 144 20.6 14
6 92721 161 25.2 14
Note that the order in the pipe may be important, but the modularity of the code makes it easy to try if there is an effect. For example, since we are only looking at 6 values here, it makes no difference if the sampling of 100 rows occurs before or after the dropping of NA
in the age column.
10.4 mutating data
As we explored in the previous demonstration (8.3.1) we can create a new column when computing data with the mutate()
function. As an example we can compute a ratio of \(\frac{height}{weight}\) as it is often a useful measure. In the BMX_I
data we can find the relevant column names:
BMXWT
- Weight (kg)BMXHT
- Standing Height (cm)BMXBMI
- Body Mass Index (kg/m**2)
We can then make a selection of columns and compute the ratio which will be saved in a new column with the name that we define for example HWRATIO
. As an example the data is then rounded to just 1 decimal point as most of the rest of the data:
Master4 %>%
select(SEQN, BMXWT, BMXHT, BMXBMI) %>%
mutate(HWRATIO = BMXHT / BMXWT,
HWRATIO2 = round(HWRATIO, digit=1)) %>%
head()
SEQN BMXWT BMXHT BMXBMI HWRATIO HWRATIO2
1 83732 94.8 184.5 27.8 1.946203 1.9
2 83733 90.4 171.4 30.8 1.896018 1.9
3 83734 83.4 170.1 28.8 2.039568 2.0
4 83735 109.8 160.9 42.4 1.465392 1.5
5 83736 55.2 164.9 20.3 2.987319 3.0
6 83737 64.4 150.0 28.6 2.329193 2.3
10.4.1 mutate with conditional statement
When analyzing tabular data we may have the choice will be made for each row. Here is an example found online: Creating New Variables in R with mutate() and ifelse()32.
# Create 3 vectors
section <- c("MATH111", "MATH111", "ENG111")
grade <- c(78, 93, 56)
student <- c("David", "Kristina", "Mycroft")
# Combine vectors into a dataframe
gradebook <- data.frame(section, grade, student)
# Test of grade level and assign an outcome
mutate(gradebook, Pass.Fail = ifelse(grade > 60, "Pass", "Fail"))
section grade student Pass.Fail
1 MATH111 78 David Pass
2 MATH111 93 Kristina Pass
3 ENG111 56 Mycroft Fail
By “nesting” multiple ifelse()
statements we can manage to provide multiple choices: the second argument is replaced by another ifelse()
statements in succession. The final iteration has a single option that is not an ifelse()
statements. In the example below, from the same source, the gradebook
is read one row at a time, and then we need to provide a grade based on the letter system: A through D or F.
mutate(gradebook,
letter =
ifelse(grade %in% 60:69, "D",
ifelse(grade %in% 70:79, "C",
ifelse(grade %in% 80:89, "B",
ifelse(grade %in% 90:99, "A",
"F")))))
section grade student letter
1 MATH111 78 David C
2 MATH111 93 Kristina A
3 ENG111 56 Mycroft F
This makes use of the %in%
operator that can easily be understood as being able to test if a value is within the proposed range. On the first line we ask if the test of column grade
for each row is between 60
and 69
. If the answer is “yes” which means the test is TRUE
, then the value will be “D”. If the test is FALSE
we’ll go to the next question and so on until the last line where the final choice is F
. (Note that for readability the code is indented, and “F” is the alternate option for the test grade %in% 90:99
.)
We’ll use this statement is a very useful way in a future section (10.7.)
10.5 Summarising and grouping data
The following data can be found in the DEMO_I demographic data file:
RIAGENDR
: 1-male, 2-female (none missing)DMDMARTL
: 1 Married, 2 Widowed, etc.DMDEDUC2
- Education level - Adults 20+INDHHIN2
- Annual household income
We want to get some information by group, in this case we’ll group by marital status and then count the total number of observations for each case and we’ll store this in column Counts
. For each marital status code, we’ll then count how many women and men are in each category that we’ll report in columns Mnum
and Wnum
(the sum of these 2 on each line should be equal to the reported Counts
column.)
For simplicity with income and education we’ll simply compute the mean of the codes, which should still give us an indication of the level of income and education for each category of marital status.
We’ll place the results in a user-defined R
object so that we can reuse it in the next section without re-writing the complete pipeline. Let’s call it Xsum
for example
Xsum <- Master4 %>%
select(SEQN, RIAGENDR, DMDMARTL, DMDEDUC2, INDHHIN2) %>%
drop_na() %>%
group_by(DMDMARTL) %>%
summarise(Counts = n() ,
Mnum = sum(RIAGENDR == 1),
Wnum = sum(RIAGENDR == 2),
MeanIncCode = mean(INDHHIN2),
MeanEducCode = mean(DMDEDUC2))
Xsum # print output
# A tibble: 8 × 6
DMDMARTL Counts Mnum Wnum MeanIncCode MeanEducCode
<dbl> <int> <int> <int> <dbl> <dbl>
1 1 2792 1481 1311 12.5 3.52
2 2 401 106 295 11.1 3.00
3 3 597 239 358 10.1 3.49
4 4 184 66 118 9.48 2.80
5 5 999 475 524 11.6 3.60
6 6 533 270 263 10.7 3.25
7 77 2 1 1 42 4
8 99 1 0 1 99 9
10.6 Recoding: string replacement
The above results are OK but it would be nice to be able to change some of the code numbers to actual English word such as “married” or “single” in text. This can be accomplished in many ways in dplyr
but one of the simplest is to use recode()
as a special case within mutate()
to overwrite a column.
Code or Value | Value Description | Count | Cumulative |
---|---|---|---|
1 | Married | 2886 | 2886 |
2 | Widowed | 421 | 3307 |
3 | Divorced | 614 | 3921 |
4 | Separated | 192 | 4113 |
5 | Never married | 1048 | 5161 |
6 | Living with partner | 555 | 5716 |
77 | Refused | 2 | 5718 |
99 | Don’t Know | 1 | 5719 |
. | Missing | 4252 | 9971 |
We can then recode the DMDMARTL
column of Xsum
with the following pipeline:
Xsum2 <- Xsum %>% mutate(DMDMARTL = recode(DMDMARTL,
`1` = "Married",
`2` = "Widowed",
`3` = "Divorced",
`4` = "Separated",
`5` = "Nevermarried",
`6` = "Living with partner",
`77` = "Refused",
`99` = "Don' Know")
)
Xsum2 # print out
# A tibble: 8 × 6
DMDMARTL Counts Mnum Wnum MeanIncCode MeanEducCode
<chr> <int> <int> <int> <dbl> <dbl>
1 Married 2792 1481 1311 12.5 3.52
2 Widowed 401 106 295 11.1 3.00
3 Divorced 597 239 358 10.1 3.49
4 Separated 184 66 118 9.48 2.80
5 Nevermarried 999 475 524 11.6 3.60
6 Living with partner 533 270 263 10.7 3.25
7 Refused 2 1 1 42 4
8 Don' Know 1 0 1 99 9
(Credit: this example inspired by How to Recode a Column with dplyr in R33)
In fact 2 of the columns are not well named and we should rename Mnum
and Wnum
to simply Men
and Women
in the table. This is done with the dplyr
rename()
function:
# A tibble: 8 × 6
DMDMARTL Counts Men Women MeanIncCode MeanEducCode
<chr> <int> <int> <int> <dbl> <dbl>
1 Married 2792 1481 1311 12.5 3.52
2 Widowed 401 106 295 11.1 3.00
3 Divorced 597 239 358 10.1 3.49
4 Separated 184 66 118 9.48 2.80
5 Nevermarried 999 475 524 11.6 3.60
6 Living with partner 533 270 263 10.7 3.25
7 Refused 2 1 1 42 4
8 Don' Know 1 0 1 99 9
10.7 Getting it all together
We now know enough that we should be able to get it all together in an annotated pipeline starting with Master4
.
10.7.1 Example 1: by gender
Let’s try to answer the question: “What is the average level of total cholesterol in men and women from the Master4
dataset.?”
To answer the question we’ll need the following:
- create an object to contain the results and redirect to it
- Start with
Master4
(inject data in the pipeline) - Select relevant columns
- Remove
NA
rows - Group by gender
- Summarize.
We now know how to do this in a data stream with a pipeline. Let’s write it and add comment lines so we can remember later the purpose of the code:
TcholGender <- Master4 %>%
# select columns
select(SEQN, RIAGENDR, LBXTC ) %>%
# fitler all rows to remove NAs
drop_na() %>%
# Group by gender
group_by(RIAGENDR) %>%
summarise(
Men = sum(RIAGENDR == 1),
Women = sum(RIAGENDR == 2),
MeanTChol = mean(LBXTC))
# Print results:
TcholGender
# A tibble: 2 × 4
RIAGENDR Men Women MeanTChol
<dbl> <int> <int> <dbl>
1 1 3545 0 178.
2 2 0 3711 183.
The final table will be short but that is exactly what a summary should be.
The value for men is 177.75 based on a total of 3545 observations.
For women the we see is 182.65 based on a total of 3711 observations.
(See section 13.2.4 later to learn how these numbers were embedded within the text automatically without copy/paste!)
10.7.2 Example 2: by gender and age
Let’s make things a bit more interesting with the question:
“Base on gender and age group, what is the mean and standard deviation of PFAS compounds, as well as the mean, standard deviation, minimum, and maximum values of total cholesterol in men and women?”
One of the key word is “age group” as in the NHANES data age is a single integer number with a range from 1
to 80
in RIDAGEYR
column and therefore it is “up to us” to create the age groups! This can be done rather easily within a combination of mutate()
and with nested ifelse()
statements within. Nesting the ifelse()
function within itself allows making multiple choices.
(For a refresher on ifelse()
see section 10.4.1.)
We’ll have to use the appropriate columns of data, compute the creatinine adjustment and summarize these values by age group for men and women.
A new base R
function is introduced here formatC()
which will force the numbers to be printed in scientific exponent for better clarity. (Try without and see the difference! - this option was suggested on Stack Overflow.)
The pipeline below is annotated to specify the function of each line and the results are saved within a user-defined R
object Example2
.
Example2 <- Master4 %>%
# select columns
select(SEQN, RIAGENDR, RIDAGEYR, LBXMFOS, URXUCR, LBXTC ) %>%
# fitler all rows to remove NAs
drop_na() %>%
# Creatinine adjustment
mutate (RATIO = (LBXMFOS/URXUCR)*10^-4) %>%
# categorize ages in 5 groups:
# Children: G0TO18, younger adults: G19TO35,
# and older adults: G36TO65, seniors: G66TO79,
# and 80 and older: G80.
mutate(AGEGROUP = ifelse(RIDAGEYR %in% 0:18, "G0TO18",
ifelse(RIDAGEYR %in% 19:35, "G19TO35",
ifelse(RIDAGEYR %in% 36:65, "G36TO65",
ifelse(RIDAGEYR %in% 66:79,
"G66TO79", "G80"))))) %>%
#
# NOTE: this would be a place to split the pipeline in 2 sections
#
# Group by gender first, age second
group_by(RIAGENDR, AGEGROUP) %>%
# Summarize: count totals,
summarise(
Men = sum(RIAGENDR == 1),
Women = sum(RIAGENDR == 2),
MeanPFASR = formatC(mean(RATIO), format = "e", digits = 2),
sdPFASR = formatC(sd(RATIO), format = "e", digits = 2),
MeanTChol = mean(LBXTC),
sdTChol = sd(LBXTC),
minTChol = min(LBXTC),
maxTChol = max(LBXTC)
)
`summarise()` has grouped output by 'RIAGENDR'. You can override using the
`.groups` argument.
# A tibble: 10 × 10
# Groups: RIAGENDR [2]
RIAGENDR AGEGROUP Men Women MeanPFASR sdPFASR MeanTChol sdTChol minTChol
<dbl> <chr> <int> <int> <chr> <chr> <dbl> <dbl> <dbl>
1 1 G0TO18 175 0 8.47e-07 8.68e-07 152. 28.6 94
2 1 G19TO35 235 0 1.68e-06 1.88e-06 180. 35.6 98
3 1 G36TO65 382 0 3.24e-06 5.99e-06 198. 46.7 111
4 1 G66TO79 117 0 3.95e-06 3.92e-06 169. 35.1 90
5 1 G80 44 0 5.12e-06 6.42e-06 165. 35.0 101
6 2 G0TO18 0 141 7.70e-07 7.88e-07 158. 29.5 103
7 2 G19TO35 0 238 9.68e-07 1.31e-06 176. 38.5 100
8 2 G36TO65 0 459 2.35e-06 3.80e-06 201. 36.3 106
9 2 G66TO79 0 124 4.58e-06 6.32e-06 202. 48.5 84
10 2 G80 0 53 5.86e-06 6.29e-06 187. 34.1 126
# ℹ 1 more variable: maxTChol <dbl>
The print-out might be truncated (depending on the format of this document.)
As expected the output is a tibble
.
A subtle but important note is in the second line of the output:
# Groups: RIAGENDR [2]
.
It may be important at a later stage to use the dplyr
function ungroup()
to modify e.g. the column name or the values within that column.
The highest cholesterol value is 545 and is for a person of gender code 1 in age group G36TO65.
(See section 13.2.4 later to learn how these numbers were embedded within the text automatically without copy/paste!)
10.7.2.1 Sorting / arranging
The Example2
is a summary table (in tibble / data frame format) and can be further sorted with the arrange()
function. For example to see a table with the cholesterol levels:
# A tibble: 2 × 3
# Groups: RIAGENDR [2]
RIAGENDR AGEGROUP maxTChol
<dbl> <chr> <dbl>
1 1 G36TO65 545
2 2 G66TO79 358
10.7.3 Base R
Bar plot
This summary table can be used with the base R
function barplot()
to create a representation of the mean cholesterol data. The default plot yields just gray bars and not horizontal label. Here are examples plotted together on a single graph that show the effect of some options.
As a reminder:
las=2
is rotating the labels (seen in section 6.3.)
col = RIAGENDR + 1
selects two of the nine colors in the baseR
color palette (see colors in section 5.5.) SinceRIAGENDR
values are1
and2
the resulting colors would be black and red which is not very appealing. Adding+ 1
will choose red and green. Other numbers will select the next colors in the list.names.arg=
specifies the source of the horizontal axis names displayed.
These commands use the with()
function but could also be written with the $
method, for example with Example2$AGEGROUP
.
par(mfrow = c(1,2))
# Plain version
with(Example2,barplot(MeanTChol))
# Add white/gray alternate colors, rotated horiz labels
with(Example2,barplot(MeanTChol,
col = RIAGENDR + 1,
names.arg=AGEGROUP, las =2 ))
Error bars? There are methods in base R
to add error bars and various examples can be found online. However, the methods are all quirky and the best is now to use ggplot
to create such graphs.
10.7.4 ggplot2 versions
Example plots have been moved to section 11.2.