6.3 group_by() and ungroup() | R for graduate students (2023)

Collect existing data and group specific variables for future operations. Many operations are performed in groups.

Example: Grouping by age and sex (male/female) can be useful in a data set if we are interested in the scores of women of a certain age compared to men of a certain age (or the comparison of ages between men and women). women).

Let's create a sample dataset to reflect this example (to avoid input errors, copy and paste this into your script):

## Creation of an identification number to represent 50 individual personsid <- C(1:50)## Creation of gender variable (25 men/25 women)sex <- representative(C("macho","female"),25)# rep means replicate## Creating age variable (20-39 years old)Age <- C(26,25,39,37,31,34,34,30,26,33,39,28,26,29,33,22,35,23,26,36,21,20,31,21,35,39,36,22,22,25,27,30,26,34,38,39,30,29,26,25,26,36,23,21,21,39,26,26,27,21)## Creating a dependent variable named ScoreScore <- C(0,010,0,418,0,014,0,090,0,061,0,328,0,656,0,002,0,639,0,173,0,076,0,152,0,467,0,186,0,520,0,493,0,388,0,501,0,800,0,482,0,384,0,046,0,920,0,865,0,625,0,035,0,501,0,851,0,285,0,752,0,686,0,339,0,710,0,665,0,214,0,560,0,287,0,665,0,630,0,567,0,812,0,637,0,772,0,905,0,405,0,363,0,773,0,410,0,535,0,449)## Create a unified dataset that brings together all variablesdata <- tibble(ID, Sex, Age, Score)

6.3.1 resume()migroup_by()

Let's say I want to calculate/compare the meanscore(and other measures) for males and females separately:

Given%>%  group by(Sex)%>%  resume(metro = mean(Score),# average s = Dakota del Sur(Score),# calculate the standard deviation norte = norte())%>% # calculate the total number of observations ungroup()
## # A tibble: 2 x 4## Sexo m s n## <chr> <dbl> <dbl> <int>## 1 femenino 0,437 0,268 25## 2 masculino 0,487 0,268 25

In the above code, we group bySex, which means that the calculations made with our data will take into account men and women separately. After running the code, the console displays the averagescore, the standard deviation (Dakota del Sur) and the total number of participants (norte()) for women and for men (group_by(Sexo)). That is, the averagescorefor women it is 0.437 and the averagescorefor men it is 0.487.

Let's group bySex mi Eranext (the order in which the variables appear withingroup_by()It doesn't matter):

Given%>%  group by(gender, age)%>% # grouped by sex and age resume(metro = mean(Score),s = Dakota del Sur(Score),norte = norte())%>%  ungroup()
## # A tibble: 27 x 5## Sex Age m s n## <chr> <dbl> <dbl> <dbl> <int> ## 1 female 20 0.046 NaN 1## 2 female 21 0.740 0.253 3## 3 Woman 22 0.672 0.253 2## 4 Woman 23 0.501 NaN 1## 5 Woman 25 0.579 0.167 3## 6 Woman 26 0.41 NaN 1## 7 Woman 28 0.152 NaN 1## 8 Woman 29 0.426 0.339 2## 9 Woman 30 0.170 0.238 2## 10 woman 33 0.173 NaN 1## # ... with 17 more lines

There are now considerably more lines (27 lines) in this output. When performing calculations, R now considers every combination ofEramiSex. For example, the average 25-year-old woman scored 0.579.

We also see that some standard deviation (NaN) values ​​are missing. This is because calculating the standard deviation requiresmore than oneparticipant observation.

6.3.2 to change()migroup_by()

we could also useto change()aftergroup_by()to add a new column based on the group.

Given%>%  group by(Sex)%>%  tap(metro = mean(Score))%>% # calculate the average score by gender ungroup()
## # A tibble: 50 x 5## ID Sex Age Score m## <int> <chr> <dbl> <dbl> <dbl>## 1 1 male 26 0.01 0.487## 2 2 female 25 0.418 0.437# # 3 3 male 39 0.014 0.487## 4 4 female 37 0.09 0.437## 5 5 male 31 0.061 0.487## 6 6 female 34 0.328 0.437## 7 7 male 34 0.656 0.487## 8 0.4 female 303# 26 male 9 0.639 0.487## 10 10 woman 33 0.173 0.437## # ... with another 40 lines

Instead of collapsing all the rows into a summary value,to change()add a new column (metro) containing the mean male and female score. The mean scores in the columnmetrocorresponds to the value ofSexcolumn (0.487 for men and 0.437 for women).


notice thatungroup()always used aftergroup()after performing the calculations. if you forgetungroup()data, future data management is likely to produce errors.All the timeungroup()when you finish your calculations.

Let's see an example of when it's important to ungroup:

## Example 1Given%>%  group by(Sex)%>%  tap(metro = mean(Age))%>% # calculate the average age of men and women tap(x = mean(Score))%>% # count the number of participants ungroup()# closing ungroup()

Compare this to the code you includeungroup()snuggled between the twoto change()functions:

## Example 2Given%>%  group by(Sex)%>%  tap(metro = mean(Age))%>% # calculate the average age of men and women ungroup()%>% # ungroup nested() tap(x = mean(Score))# count the number of participants

In the first example,metro, which averagesEra, is 29.2 if the participant is a man or 28.96 if she is a woman.X, which averagesscore, is 0.487 for men and 0.437 for women. For both calculations, the data is grouped bySex.

In the second example,metrostill averagesErafor males separated from females as in the first example. Nevertheless,Xequal to onescoreof 0.462 for each line/observation. This is becausegroup_by(Sexo)is removed viaungroup()after the firstto change()function. Here,Xaveragescoreforall participantstogether.

No method is right or wrong, it depends on what you are trying to achieve. When deciding where to place theungroup()function, ask yourself: does it make sense to compute different values ​​for thisVariable? If so, thegroup_by(Variable)The function must be writtenbeforethe calculation function (mutate/resume).

and usegroup_by(), you must have a matchungroup()somewhere. Even if you don't plan on doing additional calculations, it's a good habit to keep. making sure thatungroup()it isespecially important when creating objects!!

## Creating/Saving the object named "data1"dados1 <-  Given%>%  group by(Sex)%>%  tap(metro = mean(Age))## Use the data1 object after previously saving it (WITHOUT ungrouping)data1%>%  tap(x = mean(Score))

every time you usedata1, a saved object, will automatically havegroup_by(Sexo)as part of its definition and other calculations will take these grouping variables into account.

forgetting aboutungroup()it can get even more complex!

## Creating/Saving the object named "data1"dados1 <-  Given%>%  group by(Sex)%>%  tap(metro = mean(Age))## Use the data1 object after previously saving it (WITHOUT ungrouping)data1%>%  group by(Age)%>%  tap(x = mean(Score))%>%  ungroup()

Now the second piece of code is grouped bySex mi Era. that is theXmoving averagesscorefor each combination ofSexmiEra. Even if that's what you wanted, it's better to specify it on every line. As your scripts get larger, it can be difficult to remember details about object definitions, especially if it's a special grouping variable. Keep your objects as simple as possible!

This is the proper method for saving and using thedata1object:

dados1 <-  Given%>%  group by(Sex)%>%  tap(metro = mean(Age))%>%  ungroup()# Ungroup at the end of a definition!!!data1%>%  group by(gender, age)%>% # group relevant variables here tap(x = mean(Score))%>%  ungroup()


Top Articles
Latest Posts
Article information

Author: Carmelo Roob

Last Updated: 02/16/2023

Views: 6618

Rating: 4.4 / 5 (45 voted)

Reviews: 92% of readers found this page helpful

Author information

Name: Carmelo Roob

Birthday: 1995-01-09

Address: Apt. 915 481 Sipes Cliff, New Gonzalobury, CO 80176

Phone: +6773780339780

Job: Sales Executive

Hobby: Gaming, Jogging, Rugby, Video gaming, Handball, Ice skating, Web surfing

Introduction: My name is Carmelo Roob, I am a modern, handsome, delightful, comfortable, attractive, vast, good person who loves writing and wants to share my knowledge and understanding with you.