Too many categories
By using the same fun analysis as in my previous article, I would like too highlight a problem that often occurs, dealing with too much categories (or factors as they are called in R) avoid to see clearly the big pictures.
To stick to my previous example, I want this time to check the Body Mass Index (BMI) by species. But in the
starwars dataset there is a bunch of species, and it’s not easy to visualize the result – I have removed Jabba the Hutt since he crushes all the other categories that cannot be viewed at all, see my previous article for the full story.
The solution is to lump together the less representative categories (in this case the species) in a big category often called Other. R , in the package forcats, provides a very convenient functions to perform this operation in different flavors
fct_lump. I will use the
lumps all levels except for the
nmost frequent (or least frequent if
n < 0)
So this means if I use
mutate(species = forcats::fct_lump_n(species, n = 5)) I will end with the five most frequent species (species having the most individuals) plus the other species lumped together in the Other category.
Let’s check the result.
So it makes sense
- the human specie fall in the standard BMI range for humans,
- droids are heavier, it seems logical since they are made of metal as I far as I know.