Collect existing data and group specific variables for future operations. Many operations are performed in groups.
Example: Grouping by age and sex (male/female) can be useful in a data set if we are interested in the scores of women of a certain age compared to men of a certain age (or the comparison of ages between men and women). women).
Let's create a sample dataset to reflect this example (to avoid input errors, copy and paste this into your script):
## Creation of an identification number to represent 50 individual personsid <- C(1:50)## Creation of gender variable (25 men/25 women)sex <- representative(C("macho","female"),25)# rep means replicate## Creating age variable (20-39 years old)Age <- C(26,25,39,37,31,34,34,30,26,33,39,28,26,29,33,22,35,23,26,36,21,20,31,21,35,39,36,22,22,25,27,30,26,34,38,39,30,29,26,25,26,36,23,21,21,39,26,26,27,21)## Creating a dependent variable named ScoreScore <- C(0,010,0,418,0,014,0,090,0,061,0,328,0,656,0,002,0,639,0,173,0,076,0,152,0,467,0,186,0,520,0,493,0,388,0,501,0,800,0,482,0,384,0,046,0,920,0,865,0,625,0,035,0,501,0,851,0,285,0,752,0,686,0,339,0,710,0,665,0,214,0,560,0,287,0,665,0,630,0,567,0,812,0,637,0,772,0,905,0,405,0,363,0,773,0,410,0,535,0,449)## Create a unified dataset that brings together all variablesdata <- tibble(ID, Sex, Age, Score)
6.3.1 resume()
migroup_by()
Let's say I want to calculate/compare the meanscore
(and other measures) for males and females separately:
Given%>% group by(Sex)%>% resume(metro = mean(Score),# average s = Dakota del Sur(Score),# calculate the standard deviation norte = norte())%>% # calculate the total number of observations ungroup()
## # A tibble: 2 x 4## Sexo m s n## <chr> <dbl> <dbl> <int>## 1 femenino 0,437 0,268 25## 2 masculino 0,487 0,268 25
In the above code, we group bySex
, which means that the calculations made with our data will take into account men and women separately. After running the code, the console displays the averagescore
, the standard deviation (Dakota del Sur
) and the total number of participants (norte()
) for women and for men (group_by(Sexo)
). That is, the averagescore
for women it is 0.437 and the averagescore
for men it is 0.487.
Let's group bySex
mi Era
next (the order in which the variables appear withingroup_by()
It doesn't matter):
Given%>% group by(gender, age)%>% # grouped by sex and age resume(metro = mean(Score),s = Dakota del Sur(Score),norte = norte())%>% ungroup()
## # A tibble: 27 x 5## Sex Age m s n## <chr> <dbl> <dbl> <dbl> <int> ## 1 female 20 0.046 NaN 1## 2 female 21 0.740 0.253 3## 3 Woman 22 0.672 0.253 2## 4 Woman 23 0.501 NaN 1## 5 Woman 25 0.579 0.167 3## 6 Woman 26 0.41 NaN 1## 7 Woman 28 0.152 NaN 1## 8 Woman 29 0.426 0.339 2## 9 Woman 30 0.170 0.238 2## 10 woman 33 0.173 NaN 1## # ... with 17 more lines
There are now considerably more lines (27 lines) in this output. When performing calculations, R now considers every combination ofEra
miSex
. For example, the average 25-year-old woman scored 0.579.
We also see that some standard deviation (NaN) values are missing. This is because calculating the standard deviation requiresmore than oneparticipant observation.
6.3.2 to change()
migroup_by()
we could also useto change()
aftergroup_by()
to add a new column based on the group.
Given%>% group by(Sex)%>% tap(metro = mean(Score))%>% # calculate the average score by gender ungroup()
## # A tibble: 50 x 5## ID Sex Age Score m## <int> <chr> <dbl> <dbl> <dbl>## 1 1 male 26 0.01 0.487## 2 2 female 25 0.418 0.437# # 3 3 male 39 0.014 0.487## 4 4 female 37 0.09 0.437## 5 5 male 31 0.061 0.487## 6 6 female 34 0.328 0.437## 7 7 male 34 0.656 0.487## 8 0.4 female 303# 26 male 9 0.639 0.487## 10 10 woman 33 0.173 0.437## # ... with another 40 lines
Instead of collapsing all the rows into a summary value,to change()
add a new column (metro
) containing the mean male and female score. The mean scores in the columnmetro
corresponds to the value ofSex
column (0.487 for men and 0.437 for women).
6.3.3ungroup
notice thatungroup()
always used aftergroup()
after performing the calculations. if you forgetungroup()
data, future data management is likely to produce errors.All the timeungroup()
when you finish your calculations.
Let's see an example of when it's important to ungroup:
## Example 1Given%>% group by(Sex)%>% tap(metro = mean(Age))%>% # calculate the average age of men and women tap(x = mean(Score))%>% # count the number of participants ungroup()# closing ungroup()
Compare this to the code you includeungroup()
snuggled between the twoto change()
functions:
## Example 2Given%>% group by(Sex)%>% tap(metro = mean(Age))%>% # calculate the average age of men and women ungroup()%>% # ungroup nested() tap(x = mean(Score))# count the number of participants
In the first example,metro
, which averagesEra
, is 29.2 if the participant is a man or 28.96 if she is a woman.X
, which averagesscore
, is 0.487 for men and 0.437 for women. For both calculations, the data is grouped bySex
.
In the second example,metro
still averagesEra
for males separated from females as in the first example. Nevertheless,X
equal to onescore
of 0.462 for each line/observation. This is becausegroup_by(Sexo)
is removed viaungroup()
after the firstto change()
function. Here,X
averagescore
forall participantstogether.
No method is right or wrong, it depends on what you are trying to achieve. When deciding where to place theungroup()
function, ask yourself: does it make sense to compute different values for thisVariable
? If so, thegroup_by(Variable)
The function must be writtenbeforethe calculation function (mutate/resume).
and usegroup_by()
, you must have a matchungroup()
somewhere. Even if you don't plan on doing additional calculations, it's a good habit to keep. making sure thatungroup()
it isespecially important when creating objects!!
## Creating/Saving the object named "data1"dados1 <- Given%>% group by(Sex)%>% tap(metro = mean(Age))## Use the data1 object after previously saving it (WITHOUT ungrouping)data1%>% tap(x = mean(Score))
every time you usedata1
, a saved object, will automatically havegroup_by(Sexo)
as part of its definition and other calculations will take these grouping variables into account.
forgetting aboutungroup()
it can get even more complex!
## Creating/Saving the object named "data1"dados1 <- Given%>% group by(Sex)%>% tap(metro = mean(Age))## Use the data1 object after previously saving it (WITHOUT ungrouping)data1%>% group by(Age)%>% tap(x = mean(Score))%>% ungroup()
Now the second piece of code is grouped bySex
mi Era
. that is theX
moving averagesscore
for each combination ofSex
miEra
. Even if that's what you wanted, it's better to specify it on every line. As your scripts get larger, it can be difficult to remember details about object definitions, especially if it's a special grouping variable. Keep your objects as simple as possible!
This is the proper method for saving and using thedata1
object:
dados1 <- Given%>% group by(Sex)%>% tap(metro = mean(Age))%>% ungroup()# Ungroup at the end of a definition!!!data1%>% group by(gender, age)%>% # group relevant variables here tap(x = mean(Score))%>% ungroup()
To complete,ALWAYS UNGROUP AFTER GROUP