--- title: "tabyls: a tidy, fully-featured approach to counting things" date: '`r Sys.Date()`' output: rmarkdown::github_document vignette: > %\VignetteIndexEntry{tabyls} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r chunk_options, include = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") ``` ## Motivation: why tabyl? Analysts do a lot of counting. Indeed, it's been said that "data science is mostly counting things." But the base R function for counting, `table()`, leaves much to be desired: - It doesn't accept data.frame inputs (and thus doesn't play nicely with the `%>%` pipe) - It doesn't output data.frames - Its results are hard to format. Compare the look and formatting choices of an R table to a Microsoft Excel PivotTable or even the table formatting provided by SPSS. `tabyl()` is an approach to tabulating variables that addresses these shortcomings. It's part of the janitor package because counting is such a fundamental part of data cleaning and exploration. `tabyl()` is tidyverse-aligned and is primarily built upon the dplyr and tidyr packages. ## How it works On its surface, `tabyl()` produces frequency tables using 1, 2, or 3 variables. Under the hood, `tabyl()` also attaches a copy of these counts as an attribute of the resulting data.frame. The result looks like a basic data.frame of counts, but because it's also a `tabyl` containing this metadata, you can use `adorn_` functions to add additional information and pretty formatting. The `adorn_` functions are built to work on `tabyls`, but have been adapted to work with similar, non-tabyl data.frames that need formatting. # Examples This vignette demonstrates `tabyl` in the context of studying humans in the `starwars` dataset from dplyr: ```{r clean_starwars, warning = FALSE, message = FALSE} library(dplyr) humans <- starwars %>% filter(species == "Human") ``` ## One-way tabyl Tabulating a single variable is the simplest kind of tabyl: ```{r one_way, message=FALSE} library(janitor) t1 <- humans %>% tabyl(eye_color) t1 ``` When `NA` values are present, `tabyl()` also displays "valid" percentages, i.e., with missing values removed from the denominator. And while `tabyl()` is built to take a data.frame and column names, you can also produce a one-way tabyl by calling it directly on a vector: ```{r one_way_vector} x <- c("big", "big", "small", "small", "small", NA) tabyl(x) ``` Most `adorn_` helper functions are built for 2-way tabyls, but those that make sense for a 1-way tabyl do work: ```{r one_way_adorns} t1 %>% adorn_totals("row") %>% adorn_pct_formatting() ``` ## Two-way tabyl This is often called a "crosstab" or "contingency" table. Calling `tabyl` on two columns of a data.frame produces the same result as the common combination of `dplyr::count()`, followed by `tidyr::pivot_wider()` to wide form: ```{r two_way} t2 <- humans %>% tabyl(gender, eye_color) t2 ``` Since it's a `tabyl`, we can enhance it with `adorn_` helper functions. For instance: ```{r two_way_adorns} t2 %>% adorn_percentages("row") %>% adorn_pct_formatting(digits = 2) %>% adorn_ns() ``` Adornments have options to control axes, rounding, and other relevant formatting choices (more on that below). ## Three-way tabyl Just as `table()` accepts three variables, so does `tabyl()`, producing a list of tabyls: ```{r three_Way} t3 <- humans %>% tabyl(eye_color, skin_color, gender) # the result is a tabyl of eye color x skin color, split into a list by gender t3 ``` If the `adorn_` helper functions are called on a list of data.frames - like the output of a three-way `tabyl` call - they will call `purrr::map()` to apply themselves to each data.frame in the list: ```{r three_way_adorns, warning = FALSE, message = FALSE} library(purrr) humans %>% tabyl(eye_color, skin_color, gender, show_missing_levels = FALSE) %>% adorn_totals("row") %>% adorn_percentages("all") %>% adorn_pct_formatting(digits = 1) %>% adorn_ns() %>% adorn_title() ``` This automatic mapping supports interactive data analysis that switches between combinations of 2 and 3 variables. That way, if a user starts with `humans %>% tabyl(eye_color, skin_color)`, adds some `adorn_` calls, then decides to split the tabulation by gender and modifies their first line to `humans %>% tabyl(eye_color, skin_color, gender`), they don't have to rewrite the subsequent adornment calls to use `map()`. However, if feels more natural to call these with `map()` or `lapply()`, that is still supported. For instance, `t3 %>% lapply(adorn_percentages)` would produce the same result as `t3 %>% adorn_percentages`. ### Other features of tabyls + When called on a factor, `tabyl` will show missing levels (levels not present in the data) in the result + This can be suppressed if not desired + `NA` values can be displayed or suppressed + `tabyls` print without displaying row numbers You can call `chisq.test()` and `fisher.test()` on a two-way tabyl to perform those statistical tests, just like on a base R `table()` object. ## The `adorn_*` functions These modular functions build on a `tabyl` to approximate the functionality of a PivotTable in Microsoft Excel. They print elegant results for interactive analysis or for sharing in a report, e.g., with `knitr::kable()`. For example: ```{r} humans %>% tabyl(gender, eye_color) %>% adorn_totals(c("row", "col")) %>% adorn_percentages("row") %>% adorn_pct_formatting(rounding = "half up", digits = 0) %>% adorn_ns() %>% adorn_title("combined") %>% knitr::kable() ``` ### The adorn functions are: + **`adorn_totals()`**: Add totals row, column, or both. + **`adorn_percentages()`**: Calculate percentages along either axis or over the entire tabyl + **`adorn_pct_formatting()`**: Format percentage columns, controlling the number of digits to display and whether to append the `%` symbol + **`adorn_rounding()`**: Round a data.frame of numbers (usually the result of `adorn_percentages`), either using the base R `round()` function or using janitor's `round_half_up()` to round all ties up ([thanks, StackOverflow](https://stackoverflow.com/a/12688836/4470365)). + e.g., round 10.5 up to 11, consistent with Excel's tie-breaking behavior. + This contrasts with rounding 10.5 down to 10 as in base R's `round(10.5)`. + `adorn_rounding()` returns columns of class `numeric`, allowing for graphing, sorting, etc. It's a less-aggressive substitute for `adorn_pct_formatting()`; these two functions should not be called together. + **`adorn_ns()`**: add Ns to a tabyl. These can be drawn from the tabyl's underlying counts, which are attached to the tabyl as metadata, or they can be supplied by the user. + **`adorn_title()`**: add a title to a tabyl (or other data.frame). Options include putting the column title in a new row on top of the data.frame or combining the row and column titles in the data.frame's first name slot. These adornments should be called in a logical order, e.g., you probably want to add totals before percentages are calculated. In general, call them in the order they appear above. ## BYOt (Bring Your Own tabyl) You can also call `adorn_` functions on other data.frames, not only the results of calls to `tabyl()`. E.g., `mtcars %>% adorn_totals("col") %>% adorn_percentages("col")` performs as expected, despite `mtcars` not being a `tabyl`. This can be handy when you have a data.frame that is not a simple tabulation generated by `tabyl` but would still benefit from the `adorn_` formatting functions. A simple example: calculate the proportion of records meeting a certain condition, then format the results. ```{r first_non_tabyl} percent_above_165_cm <- humans %>% group_by(gender) %>% summarise(pct_above_165_cm = mean(height > 165, na.rm = TRUE), .groups = "drop") percent_above_165_cm %>% adorn_pct_formatting() ``` You can control which columns are adorned by using the `...` argument. It accepts the [tidyselect helpers](https://r4ds.had.co.nz/transform.html#select). That is, you can specify columns the same way you would using `dplyr::select()`. For instance, say you have a numeric column that should not be included in percentage formatting and you wish to exempt it. Here, only the `proportion` column is adorned: ```{r tidyselect, warning = FALSE, message = FALSE} mtcars %>% count(cyl, gear) %>% rename(proportion = n) %>% adorn_percentages("col", na.rm = TRUE, proportion) %>% adorn_pct_formatting(, , , proportion) # the commas say to use the default values of the other arguments ``` Here we specify that only two consecutive numeric columns should be totaled (`year` is numeric but should not be included): ```{r dont_total, warning = FALSE, message = FALSE} cases <- data.frame( region = c("East", "West"), year = 2015, recovered = c(125, 87), died = c(13, 12) ) cases %>% adorn_totals(c("col", "row"), fill = "-", na.rm = TRUE, name = "Total Cases", recovered:died) ``` Here's a more complex example that uses a data.frame of means, not counts. We create a table containing the mean of a 3rd variable when grouped by two other variables, then use `adorn_` functions to round the values and append Ns. The first part is pretty straightforward: ```{r more_non_tabyls, warning = FALSE, message = FALSE} library(tidyr) # for pivot_wider() mpg_by_cyl_and_am <- mtcars %>% group_by(cyl, am) %>% summarise(mpg = mean(mpg), .groups = "drop") %>% pivot_wider(names_from = am, values_from = mpg) mpg_by_cyl_and_am ``` Now to `adorn_` it. Since this is not the result of a `tabyl()` call, it doesn't have the underlying Ns stored in the `core` attribute, so we'll have to supply them: ```{r add_the_Ns} mpg_by_cyl_and_am %>% adorn_rounding() %>% adorn_ns( ns = mtcars %>% # calculate the Ns on the fly by calling tabyl on the original data tabyl(cyl, am) ) %>% adorn_title("combined", row_name = "Cylinders", col_name = "Is Automatic") ``` If needed, Ns can be manipulated in their own data.frame before they are appended. Here a tabyl with values in the thousands has its Ns formatted to include the separating character `,` as typically seen in American numbers, e.g., `3,000`. First we create the tabyl to adorn: ```{r formatted_Ns_thousands_prep} set.seed(1) raw_data <- data.frame( sex = rep(c("m", "f"), 3000), age = round(runif(3000, 1, 102), 0) ) raw_data$agegroup <- cut(raw_data$age, quantile(raw_data$age, c(0, 1 / 3, 2 / 3, 1))) comparison <- raw_data %>% tabyl(agegroup, sex, show_missing_levels = FALSE) %>% adorn_totals(c("row", "col")) %>% adorn_percentages("col") %>% adorn_pct_formatting(digits = 1) comparison ``` At this point, the Ns are unformatted: ```{r adorn_ns_unformatted} comparison %>% adorn_ns() ``` Now we format them to insert the thousands commas. A tabyl's raw Ns are stored in its `"core"` attribute. Here we retrieve those with `attr()`, then apply the base R function `format()` to all numeric columns. Lastly, we append these Ns using `adorn_ns()`. ```{r formatted_Ns_thousands} formatted_ns <- attr(comparison, "core") %>% # extract the tabyl's underlying Ns adorn_totals(c("row", "col")) %>% # to match the data.frame we're appending to dplyr::mutate(across(where(is.numeric), ~ format(.x, big.mark = ","))) comparison %>% adorn_ns(position = "rear", ns = formatted_ns) ``` ### Questions? Comments? File [an issue on GitHub](https://github.com/sfirke/janitor/issues) if you have suggestions related to `tabyl()` and its `adorn_` helpers or encounter problems while using them.