Title: | Simple Tools for Examining and Cleaning Dirty Data |
---|---|
Description: | The main janitor functions can: perfectly format data.frame column names; provide quick counts of variable combinations (i.e., frequency tables and crosstabs); and explore duplicate records. Other janitor functions nicely format the tabulation results. These tabulate-and-report functions approximate popular features of SPSS and Microsoft Excel. This package follows the principles of the "tidyverse" and works well with the pipe function %>%. janitor was built with beginning-to-intermediate R users in mind and is optimized for user-friendliness. |
Authors: | Sam Firke [aut, cre], Bill Denney [ctb], Chris Haid [ctb], Ryan Knight [ctb], Malte Grosser [ctb], Jonathan Zadra [ctb], Olivier Roy [ctb] |
Maintainer: | Sam Firke <[email protected]> |
License: | MIT + file LICENSE |
Version: | 2.2.0.9000 |
Built: | 2024-11-19 20:40:09 UTC |
Source: | https://github.com/sfirke/janitor |
This function adds back the underlying Ns to a tabyl
whose percentages were
calculated using adorn_percentages()
, to display the Ns and percentages together.
You can also call it on a non-tabyl data.frame to which you wish to append Ns.
adorn_ns( dat, position = "rear", ns = attr(dat, "core"), format_func = function(x) { format(x, big.mark = ",") }, ... )
adorn_ns( dat, position = "rear", ns = attr(dat, "core"), format_func = function(x) { format(x, big.mark = ",") }, ... )
dat |
A data.frame of class |
position |
Should the N go in the front, or in the rear, of the percentage? |
ns |
The Ns to append. The default is the "core" attribute of the input tabyl
|
format_func |
A formatting function to run on the Ns. Consider defining
with |
... |
Columns to adorn. This takes a tidyselect specification. By default,
all columns are adorned except for the first column and columns not of class
|
A data.frame
with Ns appended
mtcars %>% tabyl(am, cyl) %>% adorn_percentages("col") %>% adorn_pct_formatting() %>% adorn_ns(position = "front") # Format the Ns with a custom format_func: set.seed(1) bigger_dat <- data.frame( sex = rep(c("m", "f"), 3000), age = round(runif(3000, 1, 102), 0) ) bigger_dat$age_group <- cut(bigger_dat$age, quantile(bigger_dat$age, c(0, 1 / 3, 2 / 3, 1))) bigger_dat %>% tabyl(age_group, sex, show_missing_levels = FALSE) %>% adorn_totals(c("row", "col")) %>% adorn_percentages("col") %>% adorn_pct_formatting(digits = 1) %>% adorn_ns(format_func = function(x) format(x, big.mark = ".", decimal.mark = ",")) # Control the columns to be adorned with the ... variable selection argument # If using only the ... argument, you can use empty commas as shorthand # to supply the default values to the preceding arguments: cases <- data.frame( region = c("East", "West"), year = 2015, recovered = c(125, 87), died = c(13, 12) ) cases %>% adorn_percentages("col",,recovered:died) %>% adorn_pct_formatting(,,,,,recovered:died) %>% adorn_ns(,,,recovered:died)
mtcars %>% tabyl(am, cyl) %>% adorn_percentages("col") %>% adorn_pct_formatting() %>% adorn_ns(position = "front") # Format the Ns with a custom format_func: set.seed(1) bigger_dat <- data.frame( sex = rep(c("m", "f"), 3000), age = round(runif(3000, 1, 102), 0) ) bigger_dat$age_group <- cut(bigger_dat$age, quantile(bigger_dat$age, c(0, 1 / 3, 2 / 3, 1))) bigger_dat %>% tabyl(age_group, sex, show_missing_levels = FALSE) %>% adorn_totals(c("row", "col")) %>% adorn_percentages("col") %>% adorn_pct_formatting(digits = 1) %>% adorn_ns(format_func = function(x) format(x, big.mark = ".", decimal.mark = ",")) # Control the columns to be adorned with the ... variable selection argument # If using only the ... argument, you can use empty commas as shorthand # to supply the default values to the preceding arguments: cases <- data.frame( region = c("East", "West"), year = 2015, recovered = c(125, 87), died = c(13, 12) ) cases %>% adorn_percentages("col",,recovered:died) %>% adorn_pct_formatting(,,,,,recovered:died) %>% adorn_ns(,,,recovered:died)
data.frame
of decimals as percentages.Numeric columns get multiplied by 100 and formatted as
percentages according to user specifications. This function defaults to
excluding the first column of the input data.frame, assuming that it contains
a descriptive variable, but this can be overridden by specifying the columns
to adorn in the ...
argument. Non-numeric columns are always excluded.
The decimal separator character is the result of getOption("OutDec")
, which
is based on the user's locale. If the default behavior is undesirable,
change this value ahead of calling the function, either by changing locale or
with options(OutDec = ",")
. This aligns the decimal separator character
with that used in base::print()
.
adorn_pct_formatting( dat, digits = 1, rounding = "half to even", affix_sign = TRUE, ... )
adorn_pct_formatting( dat, digits = 1, rounding = "half to even", affix_sign = TRUE, ... )
dat |
a data.frame with decimal values, typically the result of a call
to |
digits |
how many digits should be displayed after the decimal point? |
rounding |
method to use for rounding - either "half to even", the base R default method, or "half up", where 14.5 rounds up to 15. |
affix_sign |
should the % sign be affixed to the end? |
... |
columns to adorn. This takes a tidyselect specification. By
default, all numeric columns (besides the initial column, if numeric) are
adorned, but this allows you to manually specify which columns should be
adorned, for use on a data.frame that does not result from a call to
|
a data.frame with formatted percentages
mtcars %>% tabyl(am, cyl) %>% adorn_percentages("col") %>% adorn_pct_formatting() # Control the columns to be adorned with the ... variable selection argument # If using only the ... argument, you can use empty commas as shorthand # to supply the default values to the preceding arguments: cases <- data.frame( region = c("East", "West"), year = 2015, recovered = c(125, 87), died = c(13, 12) ) cases %>% adorn_percentages("col", , recovered:died) %>% adorn_pct_formatting(, , , recovered:died)
mtcars %>% tabyl(am, cyl) %>% adorn_percentages("col") %>% adorn_pct_formatting() # Control the columns to be adorned with the ... variable selection argument # If using only the ... argument, you can use empty commas as shorthand # to supply the default values to the preceding arguments: cases <- data.frame( region = c("East", "West"), year = 2015, recovered = c(125, 87), died = c(13, 12) ) cases %>% adorn_percentages("col", , recovered:died) %>% adorn_pct_formatting(, , , recovered:died)
This function defaults to excluding the first column of the input data.frame,
assuming that it contains a descriptive variable, but this can be overridden
by specifying the columns to adorn in the ...
argument.
adorn_percentages(dat, denominator = "row", na.rm = TRUE, ...)
adorn_percentages(dat, denominator = "row", na.rm = TRUE, ...)
dat |
A |
denominator |
The direction to use for calculating percentages. One of "row", "col", or "all". |
na.rm |
should missing values (including |
... |
columns to adorn. This takes a < |
A data.frame
of percentages, expressed as numeric values between 0 and 1.
mtcars %>% tabyl(am, cyl) %>% adorn_percentages("col") # calculates correctly even with totals column and/or row: mtcars %>% tabyl(am, cyl) %>% adorn_totals("row") %>% adorn_percentages() # Control the columns to be adorned with the ... variable selection argument # If using only the ... argument, you can use empty commas as shorthand # to supply the default values to the preceding arguments: cases <- data.frame( region = c("East", "West"), year = 2015, recovered = c(125, 87), died = c(13, 12) ) cases %>% adorn_percentages(, , recovered:died)
mtcars %>% tabyl(am, cyl) %>% adorn_percentages("col") # calculates correctly even with totals column and/or row: mtcars %>% tabyl(am, cyl) %>% adorn_totals("row") %>% adorn_percentages() # Control the columns to be adorned with the ... variable selection argument # If using only the ... argument, you can use empty commas as shorthand # to supply the default values to the preceding arguments: cases <- data.frame( region = c("East", "West"), year = 2015, recovered = c(125, 87), died = c(13, 12) ) cases %>% adorn_percentages(, , recovered:died)
Can run on any data.frame
with at least one numeric column.
This function defaults to excluding the first column of the input data.frame,
assuming that it contains a descriptive variable, but this can be overridden by
specifying the columns to round in the ...
argument.
If you're formatting percentages, e.g., the result of adorn_percentages()
,
use adorn_pct_formatting()
instead. This is a more flexible variant for ad-hoc usage.
Compared to adorn_pct_formatting()
, it does not multiply by 100 or pad the
numbers with spaces for alignment in the results data.frame
.
This function retains the class of numeric input columns.
adorn_rounding(dat, digits = 1, rounding = "half to even", ...)
adorn_rounding(dat, digits = 1, rounding = "half to even", ...)
dat |
A |
digits |
How many digits should be displayed after the decimal point? |
rounding |
Method to use for rounding - either "half to even" (the base R default method), or "half up", where 14.5 rounds up to 15. |
... |
Columns to adorn. This takes a tidyselect specification.
By default, all numeric columns (besides the initial column, if numeric)
are adorned, but this allows you to manually specify which columns should
be adorned, for use on a data.frame that does not result from a call to |
The data.frame
with rounded numeric columns.
mtcars %>% tabyl(am, cyl) %>% adorn_percentages() %>% adorn_rounding(digits = 2, rounding = "half up") # tolerates non-numeric columns: library(dplyr) mtcars %>% tabyl(am, cyl) %>% adorn_percentages("all") %>% mutate(dummy = "a") %>% adorn_rounding() # Control the columns to be adorned with the ... variable selection argument # If using only the ... argument, you can use empty commas as shorthand # to supply the default values to the preceding arguments: cases <- data.frame( region = c("East", "West"), year = 2015, recovered = c(125, 87), died = c(13, 12) ) cases %>% adorn_percentages(, , ends_with("ed")) %>% adorn_rounding(, , all_of(c("recovered", "died")))
mtcars %>% tabyl(am, cyl) %>% adorn_percentages() %>% adorn_rounding(digits = 2, rounding = "half up") # tolerates non-numeric columns: library(dplyr) mtcars %>% tabyl(am, cyl) %>% adorn_percentages("all") %>% mutate(dummy = "a") %>% adorn_rounding() # Control the columns to be adorned with the ... variable selection argument # If using only the ... argument, you can use empty commas as shorthand # to supply the default values to the preceding arguments: cases <- data.frame( region = c("East", "West"), year = 2015, recovered = c(125, 87), died = c(13, 12) ) cases %>% adorn_percentages(, , ends_with("ed")) %>% adorn_rounding(, , all_of(c("recovered", "died")))
This function adds the column variable name to the top of a tabyl
for a
complete display of information. This makes the tabyl prettier, but renders
the data.frame
less useful for further manipulation.
adorn_title(dat, placement = "top", row_name, col_name)
adorn_title(dat, placement = "top", row_name, col_name)
dat |
A |
placement |
The title placement, one of |
row_name |
(optional) default behavior is to pull the row name from the
attributes of the input |
col_name |
(optional) default behavior is to pull the column_name from
the attributes of the input |
The placement
argument indicates whether the column name should be added to
the top
of the tabyl in an otherwise-empty row "top"
or appended to the
already-present row name variable ("combined"
). The formatting in the "top"
option has the look of base R's table()
; it also wipes out the other column
names, making it hard to further use the data.frame
besides formatting it for reporting.
The "combined"
option is more conservative in this regard.
The input tabyl
, augmented with the column title. Non-tabyl inputs
that are of class tbl_df
are downgraded to basic data.frames so that the
title row prints correctly.
mtcars %>% tabyl(am, cyl) %>% adorn_title(placement = "top") # Adding a title to a non-tabyl library(tidyr) library(dplyr) mtcars %>% group_by(gear, am) %>% summarise(avg_mpg = mean(mpg), .groups = "drop") %>% pivot_wider(names_from = am, values_from = avg_mpg) %>% adorn_rounding() %>% adorn_title("top", row_name = "Gears", col_name = "Cylinders")
mtcars %>% tabyl(am, cyl) %>% adorn_title(placement = "top") # Adding a title to a non-tabyl library(tidyr) library(dplyr) mtcars %>% group_by(gear, am) %>% summarise(avg_mpg = mean(mpg), .groups = "drop") %>% pivot_wider(names_from = am, values_from = avg_mpg) %>% adorn_rounding() %>% adorn_title("top", row_name = "Gears", col_name = "Cylinders")
This function defaults to excluding the first column of the input data.frame,
assuming that it contains a descriptive variable, but this can be overridden
by specifying the columns to be totaled in the ...
argument. Non-numeric
columns are converted to character class and have a user-specified fill character
inserted in the totals row.
adorn_totals(dat, where = "row", fill = "-", na.rm = TRUE, name = "Total", ...)
adorn_totals(dat, where = "row", fill = "-", na.rm = TRUE, name = "Total", ...)
dat |
An input |
where |
One of "row", "col", or |
fill |
If there are non-numeric columns, what should fill the bottom row
of those columns? If a string, relevant columns will be coerced to character.
If |
na.rm |
Should missing values (including |
name |
Name of the totals row and/or column. If both are created, and
|
... |
Columns to total. This takes a tidyselect specification. By default,
all numeric columns (besides the initial column, if numeric) are included in
the totals, but this allows you to manually specify which columns should be
included, for use on a data.frame that does not result from a call to |
A data.frame
augmented with a totals row, column, or both.
The data.frame
is now also of class tabyl
and stores information about
the attached totals and underlying data in the tabyl attributes.
mtcars %>% tabyl(am, cyl) %>% adorn_totals()
mtcars %>% tabyl(am, cyl) %>% adorn_totals()
tabyl
attributes to a data.frameA tabyl
is a data.frame
containing counts of a variable or
co-occurrences of two variables (a.k.a., a contingency table or crosstab).
This specialized kind of data.frame has attributes that enable adorn_
functions to be called for precise formatting and presentation of results.
E.g., display results as a mix of percentages, Ns, add totals rows or
columns, rounding options, in the style of Microsoft Excel PivotTable.
A tabyl
can be the result of a call to janitor::tabyl()
, in which case
these attributes are added automatically. This function adds tabyl
class
attributes to a data.frame that isn't the result of a call to tabyl
but
meets the requirements of a two-way tabyl: 1) First column contains values of
variable 1 2) Column names 2:n are the values of variable 2 3) Numeric values
in columns 2:n are counts of the co-occurrences of the two variables.*
= this is the ideal form of a tabyl
, but janitor's adorn_
functions tolerate
and ignore non-numeric columns in positions 2:n.
For instance, the result of dplyr::count()
followed by tidyr::pivot_wider()
can be treated as a tabyl
.
The result of calling tabyl()
on a single variable is a special class of
one-way tabyl; this function only pertains to the two-way tabyl.
as_tabyl(dat, axes = 2, row_var_name = NULL, col_var_name = NULL)
as_tabyl(dat, axes = 2, row_var_name = NULL, col_var_name = NULL)
dat |
a data.frame with variable values in the first column and numeric values in all other columns. |
axes |
is this a two_way tabyl or a one_way tabyl? If this function is
being called by a user, this should probably be "2". One-way tabyls are
created by |
row_var_name |
(optional) the name of the variable in the row dimension;
used by |
col_var_name |
(optional) the name of the variable in the column
dimension; used by |
Returns the same data.frame, but with the additional class of "tabyl" and the attribute "core".
as_tabyl(mtcars)
as_tabyl(mtcars)
stats::chisq.test()
to a two-way tabylThis generic function overrides stats::chisq.test
. If the passed table
is a two-way tabyl, it runs it through janitor::chisq.test.tabyl, otherwise
it just calls stats::chisq.test()
.
chisq.test(x, ...) ## Default S3 method: chisq.test(x, y = NULL, ...) ## S3 method for class 'tabyl' chisq.test(x, tabyl_results = TRUE, ...)
chisq.test(x, ...) ## Default S3 method: chisq.test(x, y = NULL, ...) ## S3 method for class 'tabyl' chisq.test(x, tabyl_results = TRUE, ...)
x |
a two-way tabyl, a numeric vector or a factor |
... |
other parameters passed to |
y |
if x is a vector, must be another vector or factor of the same length |
tabyl_results |
If |
The result is the same as the one of stats::chisq.test()
.
If tabyl_results
is TRUE
, the returned tables observed
, expected
,
residuals
and stdres
are converted to tabyls.
tab <- tabyl(mtcars, gear, cyl) chisq.test(tab) chisq.test(tab)$residuals
tab <- tabyl(mtcars, gear, cyl) chisq.test(tab) chisq.test(tab)$residuals
Resulting names are unique and consist only of the _
character, numbers, and letters.
Capitalization preferences can be specified using the case
parameter.
Accented characters are transliterated to ASCII. For example, an "o" with a German umlaut over it becomes "o", and the Spanish character "enye" becomes "n".
This function takes and returns a data.frame, for ease of piping with
%>%
. For the underlying function that works on a character vector
of names, see make_clean_names()
. clean_names
relies on the versatile function snakecase::to_any_case()
, which
accepts many arguments. See that function's documentation for ideas on getting
the most out of clean_names
. A few examples are included below.
A common issue is that the micro/mu symbol is replaced by "m" instead of "u".
The replacement with "m" is more correct when doing Greek-to-ASCII
transliteration but less correct when doing scientific data-to-ASCII
transliteration. A warning will be generated if the "m" replacement occurs.
To replace with "u", please add the argument replace=janitor:::mu_to_u
which is a character vector mapping all known mu or micro Unicode code points
(characters) to "u".
clean_names(dat, ...) ## Default S3 method: clean_names(dat, ...) ## S3 method for class 'sf' clean_names(dat, ...) ## S3 method for class 'tbl_graph' clean_names(dat, ...) ## S3 method for class 'tbl_lazy' clean_names(dat, ...)
clean_names(dat, ...) ## Default S3 method: clean_names(dat, ...) ## S3 method for class 'sf' clean_names(dat, ...) ## S3 method for class 'tbl_graph' clean_names(dat, ...) ## S3 method for class 'tbl_lazy' clean_names(dat, ...)
dat |
The input |
... |
Arguments passed on to
|
clean_names()
is intended to be used on data.frames
and data.frame
-like objects. For this reason there are methods to
support using clean_names()
on sf
and tbl_graph
(from
tidygraph
) objects as well as on database connections through
dbplyr
. For cleaning other named objects like named lists
and vectors, use make_clean_names()
.
A data.frame
with clean names.
Other Set names:
find_header()
,
mu_to_u
,
row_to_names()
# --- Simple Usage --- x <- data.frame(caseID = 1, DOB = 2, Other = 3) clean_names(x) # or pipe in the input data.frame: x %>% clean_names() # if you prefer camelCase variable names: x %>% clean_names(., "lower_camel") # (not run) run clean_names after reading in a spreadsheet: # library(readxl) # read_excel("messy_excel_file.xlsx") %>% # clean_names() # --- Taking advantage of the underlying snakecase::to_any_case arguments --- # Restore column names to Title Case, e.g., for plotting mtcars %>% clean_names(case = "title") # Tell clean_names to leave certain abbreviations untouched: x %>% clean_names(case = "upper_camel", abbreviations = c("ID", "DOB"))
# --- Simple Usage --- x <- data.frame(caseID = 1, DOB = 2, Other = 3) clean_names(x) # or pipe in the input data.frame: x %>% clean_names() # if you prefer camelCase variable names: x %>% clean_names(., "lower_camel") # (not run) run clean_names after reading in a spreadsheet: # library(readxl) # read_excel("messy_excel_file.xlsx") %>% # clean_names() # --- Taking advantage of the underlying snakecase::to_any_case arguments --- # Restore column names to Title Case, e.g., for plotting mtcars %>% clean_names(case = "title") # Tell clean_names to leave certain abbreviations untouched: x %>% clean_names(case = "upper_camel", abbreviations = c("ID", "DOB"))
Generate a comparison of data.frames (or similar objects) that indicates if they will successfully bind together by rows.
compare_df_cols( ..., return = c("all", "match", "mismatch"), bind_method = c("bind_rows", "rbind"), strict_description = FALSE )
compare_df_cols( ..., return = c("all", "match", "mismatch"), bind_method = c("bind_rows", "rbind"), strict_description = FALSE )
... |
A combination of data.frames, tibbles, and lists of data.frames/tibbles. The values may optionally be named arguments; if named, the output column will be the name; if not named, the output column will be the data.frame name (see examples section). |
return |
Should a summary of "all" columns be returned, only return "match"ing columns, or only "mismatch"ing columns? |
bind_method |
What method of binding should be used to determine
matches? With "bind_rows", columns missing from a data.frame would be
considered a match (as in |
strict_description |
Passed to |
Due to the returned "column_name" column, no input data.frame may be named "column_name".
The strict_description
argument is most typically used to understand
if factor levels match or are bindable. Factors are typically bindable,
but the behavior of what happens when they bind differs based on the
binding method ("bind_rows" or "rbind"). Even when
strict_description
is FALSE
, data.frames may still bind
because some classes (like factors and characters) can bind even if they
appear to differ.
A data.frame with a column named "column_name" with a value named
after the input data.frames' column names, and then one column per
data.frame (named after the input data.frame). If more than one input has
the same column name, the column naming will have suffixes defined by
sequential use of base::merge()
and may differ from expected naming.
The rows within the data.frame-named columns are descriptions of the
classes of the data within the columns (generated by
describe_class
).
Other data frame type comparison:
compare_df_cols_same()
,
describe_class()
compare_df_cols(data.frame(A = 1), data.frame(B = 2)) # user-defined names compare_df_cols(dfA = data.frame(A = 1), dfB = data.frame(B = 2)) # a combination of list and data.frame input compare_df_cols(listA = list(dfA = data.frame(A = 1), dfB = data.frame(B = 2)), data.frame(A = 3))
compare_df_cols(data.frame(A = 1), data.frame(B = 2)) # user-defined names compare_df_cols(dfA = data.frame(A = 1), dfB = data.frame(B = 2)) # a combination of list and data.frame input compare_df_cols(listA = list(dfA = data.frame(A = 1), dfB = data.frame(B = 2)), data.frame(A = 3))
Check whether a set of data.frames are row-bindable. Calls compare_df_cols()
and returns TRUE
if there are no mis-matching rows.
compare_df_cols_same( ..., bind_method = c("bind_rows", "rbind"), verbose = TRUE )
compare_df_cols_same( ..., bind_method = c("bind_rows", "rbind"), verbose = TRUE )
... |
A combination of data.frames, tibbles, and lists of data.frames/tibbles. The values may optionally be named arguments; if named, the output column will be the name; if not named, the output column will be the data.frame name (see examples section). |
bind_method |
What method of binding should be used to determine
matches? With "bind_rows", columns missing from a data.frame would be
considered a match (as in |
verbose |
Print the mismatching columns if binding will fail. |
TRUE
if row binding will succeed or FALSE
if it will fail.
Other data frame type comparison:
compare_df_cols()
,
describe_class()
compare_df_cols_same(data.frame(A = 1), data.frame(A = 2)) compare_df_cols_same(data.frame(A = 1), data.frame(B = 2)) compare_df_cols_same(data.frame(A = 1), data.frame(B = 2), verbose = FALSE) compare_df_cols_same(data.frame(A = 1), data.frame(B = 2), bind_method = "rbind")
compare_df_cols_same(data.frame(A = 1), data.frame(A = 2)) compare_df_cols_same(data.frame(A = 1), data.frame(B = 2)) compare_df_cols_same(data.frame(A = 1), data.frame(B = 2), verbose = FALSE) compare_df_cols_same(data.frame(A = 1), data.frame(B = 2), bind_method = "rbind")
Convert many date and date-time (POSIXct) formats as may be received from Microsoft Excel.
convert_to_date( x, ..., character_fun = lubridate::ymd, string_conversion_failure = c("error", "warning") ) convert_to_datetime( x, ..., tz = "UTC", character_fun = lubridate::ymd_hms, string_conversion_failure = c("error", "warning") )
convert_to_date( x, ..., character_fun = lubridate::ymd, string_conversion_failure = c("error", "warning") ) convert_to_datetime( x, ..., tz = "UTC", character_fun = lubridate::ymd_hms, string_conversion_failure = c("error", "warning") )
x |
The object to convert |
... |
Passed to further methods. Eventually may be passed to
|
character_fun |
A function to convert non-numeric-looking, non- |
string_conversion_failure |
If a character value fails to parse into the
desired class and instead returns |
tz |
The timezone for POSIXct output, unless an object is POSIXt already. Ignored for Date output. |
Character conversion checks if it matches something that looks like a
Microsoft Excel numeric date, converts those to numeric, and then runs
convert_to_datetime_helper() on those numbers. Then, character to Date or
POSIXct conversion occurs via character_fun(x, ...)
or
character_fun(x, tz=tz, ...)
, respectively.
POSIXct objects for convert_to_datetime()
or Date objects for
convert_to_date()
.
Other date-time cleaning:
excel_numeric_to_date()
,
excel_time_to_numeric()
,
sas_numeric_to_date()
convert_to_date("2009-07-06") convert_to_date(40000) convert_to_date("40000.1") # Mixed date source data can be provided. convert_to_date(c("2020-02-29", "40000.1")) convert_to_datetime( c("2009-07-06", "40000.1", "40000", NA), character_fun = lubridate::ymd_h, truncated = 1, tz = "UTC" )
convert_to_date("2009-07-06") convert_to_date(40000) convert_to_date("40000.1") # Mixed date source data can be provided. convert_to_date(c("2020-02-29", "40000.1")) convert_to_datetime( c("2009-07-06", "40000.1", "40000", NA), character_fun = lubridate::ymd_h, truncated = 1, tz = "UTC" )
Describe the class(es) of an object
describe_class(x, strict_description = TRUE) ## S3 method for class 'factor' describe_class(x, strict_description = TRUE) ## Default S3 method: describe_class(x, strict_description = TRUE)
describe_class(x, strict_description = TRUE) ## S3 method for class 'factor' describe_class(x, strict_description = TRUE) ## Default S3 method: describe_class(x, strict_description = TRUE)
x |
The object to describe |
strict_description |
Should differing factor levels be treated
as differences for the purposes of identifying mismatches?
|
For package developers, an S3 generic method can be written for
describe_class()
for custom classes that may need more definition
than the default method. This function is called by compare_df_cols()
.
A character scalar describing the class(es) of an object where if the scalar will match, columns in a data.frame (or similar object) should bind together without issue.
describe_class(factor)
: Describe factors with their levels
and if they are ordered.
describe_class(default)
: List all classes of an object.
Other data frame type comparison:
compare_df_cols()
,
compare_df_cols_same()
describe_class(1) describe_class(factor("A")) describe_class(ordered(c("A", "B"))) describe_class(ordered(c("A", "B")), strict_description = FALSE)
describe_class(1) describe_class(factor("A")) describe_class(ordered(c("A", "B"))) describe_class(ordered(c("A", "B")), strict_description = FALSE)
Converts numbers like 42370
into date values like 2016-01-01
.
Defaults to the modern Excel date encoding system. However, Excel for Mac 2008 and earlier Mac versions of Excel used a different date system. To determine what platform to specify: if the date 2016-01-01 is represented by the number 42370 in your spreadsheet, it's the modern system. If it's 40908, it's the old Mac system. More on date encoding systems at http://support.office.com/en-us/article/Date-calculations-in-Excel-e7fe7167-48a9-4b96-bb53-5612a800b487.
A list of all timezones is available from base::OlsonNames()
, and the
current timezone is available from base::Sys.timezone()
.
If your input data has a mix of Excel numeric dates and actual dates, see the
more powerful functions convert_to_date()
and convert_to_datetime()
.
excel_numeric_to_date( date_num, date_system = "modern", include_time = FALSE, round_seconds = TRUE, tz = Sys.timezone() )
excel_numeric_to_date( date_num, date_system = "modern", include_time = FALSE, round_seconds = TRUE, tz = Sys.timezone() )
date_num |
numeric vector of serial numbers to convert. |
date_system |
the date system, either |
include_time |
Include the time (hours, minutes, seconds) in the output? (See details) |
round_seconds |
Round the seconds to an integer (only has an effect when
|
tz |
Time zone, used when |
When using include_time=TRUE
, days with leap seconds will not
be accurately handled as they do not appear to be accurately handled by
Windows (as described in
https://support.microsoft.com/en-us/help/2722715/support-for-the-leap-second).
Returns a vector of class Date if include_time
is
FALSE
. Returns a vector of class POSIXlt if include_time
is
TRUE
.
Other date-time cleaning:
convert_to_date()
,
excel_time_to_numeric()
,
sas_numeric_to_date()
excel_numeric_to_date(40000) excel_numeric_to_date(40000.5) # No time is included excel_numeric_to_date(40000.5, include_time = TRUE) # Time is included excel_numeric_to_date(40000.521, include_time = TRUE) # Time is included excel_numeric_to_date(40000.521, include_time = TRUE, round_seconds = FALSE ) # Time with fractional seconds is included
excel_numeric_to_date(40000) excel_numeric_to_date(40000.5) # No time is included excel_numeric_to_date(40000.5, include_time = TRUE) # Time is included excel_numeric_to_date(40000.521, include_time = TRUE) # Time is included excel_numeric_to_date(40000.521, include_time = TRUE, round_seconds = FALSE ) # Time with fractional seconds is included
Convert a time that may be inconsistently or inconveniently formatted from Microsoft Excel to a numeric number of seconds between 0 and 86400.
excel_time_to_numeric(time_value, round_seconds = TRUE)
excel_time_to_numeric(time_value, round_seconds = TRUE)
time_value |
A vector of values to convert (see Details) |
round_seconds |
Should the output number of seconds be rounded to an integer? |
time_value
may be one of the following formats:
numericThe input must be a value from 0 to 1 (exclusive of 1); this value is returned as-is.
POSIXlt or POSIXctThe input must be on the day 1899-12-31 (any other day will cause an error). The time of day is extracted and converted to a fraction of a day.
characterAny of the following (or a mixture of the choices):
A character string that is a number between 0 and 1 (exclusive of 1). This value will be converted like a numeric value.
A character string that looks like a date on 1899-12-31 (specifically, it must start with "1899-12-31 "
), converted like a POSIXct object as described above.
A character string that looks like a time. Choices are 12-hour time as hour, minute, and optionally second followed by "am" or "pm" (case insensitive) or 24-hour time when hour, minute, optionally second, and no "am" or "pm" is included.
A vector of numbers >= 0 and <86400
Other date-time cleaning:
convert_to_date()
,
excel_numeric_to_date()
,
sas_numeric_to_date()
Find the header row in a data.frame
find_header(dat, ...)
find_header(dat, ...)
dat |
The input data.frame |
... |
See details |
If ...
is missing, then the first row with no missing values is used.
When searching for a specified value or value within a column, the first row
with a match will be returned, regardless of the completeness of the rest of
that row. If ...
has a single character argument, then the first
column is searched for that value. If ...
has a named numeric
argument, then the column whose position number matches the value of that
argument is searched for the name (see the last example below). If more than one
row is found matching a value that is searched for, the number of the first
matching row will be returned (with a warning).
The row number for the header row
Other Set names:
clean_names()
,
mu_to_u
,
row_to_names()
# the first row find_header(data.frame(A = "B")) # the second row find_header(data.frame(A = c(NA, "B"))) # the second row since the first has an empty value find_header(data.frame(A = c(NA, "B"), B = c("C", "D"))) # The third row because the second column was searched for the text "E" find_header(data.frame(A = c(NA, "B", "C", "D"), B = c("C", "D", "E", "F")), "E" = 2)
# the first row find_header(data.frame(A = "B")) # the second row find_header(data.frame(A = c(NA, "B"))) # the second row since the first has an empty value find_header(data.frame(A = c(NA, "B"), B = c("C", "D"))) # The third row because the second column was searched for the text "E" find_header(data.frame(A = c(NA, "B", "C", "D"), B = c("C", "D", "E", "F")), "E" = 2)
stats::fisher.test()
to a two-way tabylThis generic function overrides stats::fisher.test()
. If the passed table
is a two-way tabyl, it runs it through janitor::fisher.test.tabyl
, otherwise
it just calls stats::fisher.test()
.
fisher.test(x, ...) ## Default S3 method: fisher.test(x, y = NULL, ...) ## S3 method for class 'tabyl' fisher.test(x, ...)
fisher.test(x, ...) ## Default S3 method: fisher.test(x, y = NULL, ...) ## S3 method for class 'tabyl' fisher.test(x, ...)
x |
A two-way tabyl, a numeric vector or a factor |
... |
Parameters passed to |
y |
if x is a vector, must be another vector or factor of the same length |
The same as the one of stats::fisher.test()
.
tab <- tabyl(mtcars, gear, cyl) fisher.test(tab)
tab <- tabyl(mtcars, gear, cyl) fisher.test(tab)
data.frame
with identical values for the specified variables.For hunting duplicate records during data cleaning. Specify the data.frame and the variable combination to search for duplicates and get back the duplicated rows.
get_dupes(dat, ...)
get_dupes(dat, ...)
dat |
The input |
... |
Unquoted variable names to search for duplicates. This takes a tidyselect specification. |
A data.frame with the full records where the specified
variables have duplicated values, as well as a variable dupe_count
showing the number of rows sharing that combination of duplicated values.
If the input data.frame was of class tbl_df
, the output is as well.
get_dupes(mtcars, mpg, hp) # or called with the magrittr pipe %>% : mtcars %>% get_dupes(wt) # You can use tidyselect helpers to specify variables: mtcars %>% get_dupes(-c(wt, qsec)) mtcars %>% get_dupes(starts_with("cy"))
get_dupes(mtcars, mpg, hp) # or called with the magrittr pipe %>% : mtcars %>% get_dupes(wt) # You can use tidyselect helpers to specify variables: mtcars %>% get_dupes(-c(wt, qsec)) mtcars %>% get_dupes(starts_with("cy"))
Find the list of columns that have a 1:1 mapping to each other
get_one_to_one(dat)
get_one_to_one(dat)
dat |
A |
A list with one element for each group of columns that map identically to each other.
foo <- data.frame( Lab_Test_Long = c("Cholesterol, LDL", "Cholesterol, LDL", "Glucose"), Lab_Test_Short = c("CLDL", "CLDL", "GLUC"), LOINC = c(12345, 12345, 54321), Person = c("Sam", "Bill", "Sam"), stringsAsFactors = FALSE ) get_one_to_one(foo)
foo <- data.frame( Lab_Test_Long = c("Cholesterol, LDL", "Cholesterol, LDL", "Glucose"), Lab_Test_Short = c("CLDL", "CLDL", "GLUC"), LOINC = c(12345, 12345, 54321), Person = c("Sam", "Bill", "Sam"), stringsAsFactors = FALSE ) get_one_to_one(foo)
Resulting strings are unique and consist only of the _
character, numbers, and letters. By default, the resulting strings will only
consist of ASCII characters, but non-ASCII (e.g. Unicode) may be allowed by
setting ascii = FALSE
. Capitalization preferences can be specified
using the case
parameter.
For use on the names of a data.frame, e.g., in a %>%
pipeline,
call the convenience function clean_names()
.
When ascii = TRUE
(the default), accented characters are transliterated
to ASCII. For example, an "o" with a German umlaut over it becomes "o", and
the Spanish character "enye" becomes "n".
The order of operations is: make replacements, (optional) ASCII conversion,
remove initial spaces and punctuation, apply base::make.names()
,
apply snakecase::to_any_case(()
, and add numeric suffixes
to resolve any duplicated names.
This function relies on snakecase::to_any_case()
and can take advantage of
its versatility. For instance, an abbreviation like "ID" can have its
capitalization preserved by passing the argument abbreviations = "ID"
.
See the documentation for snakecase::to_any_case()
for more about how to use its features.
On some systems, not all transliterators to ASCII are available. If this is
the case on your system, all available transliterators will be used, and a
warning will be issued once per session indicating that results may be
different when run on a different system. That warning can be disabled with
options(janitor_warn_transliterators=FALSE)
.
If the objective of your call to make_clean_names()
is only to translate to
ASCII, try the following instead:
stringi::stri_trans_general(x, id="Any-Latin;Greek-Latin;Latin-ASCII")
.
make_clean_names( string, case = "snake", replace = c(`'` = "", `"` = "", `%` = "_percent_", `#` = "_number_"), ascii = TRUE, use_make_names = TRUE, allow_dupes = FALSE, sep_in = "\\.", transliterations = "Latin-ASCII", parsing_option = 1, numerals = "asis", ... )
make_clean_names( string, case = "snake", replace = c(`'` = "", `"` = "", `%` = "_percent_", `#` = "_number_"), ascii = TRUE, use_make_names = TRUE, allow_dupes = FALSE, sep_in = "\\.", transliterations = "Latin-ASCII", parsing_option = 1, numerals = "asis", ... )
string |
A character vector of names to clean. |
case |
The desired target case (default is |
replace |
A named character vector where the name is replaced by the value. |
ascii |
Convert the names to ASCII ( |
use_make_names |
Should |
allow_dupes |
Allow duplicates in the returned names ( |
sep_in |
(short for separator input) if character, is interpreted as a
regular expression (wrapped internally into |
transliterations |
A character vector (if not |
parsing_option |
An integer that will determine the parsing_option.
|
numerals |
A character specifying the alignment of numerals ( |
... |
Arguments passed on to
|
Returns the "cleaned" character vector.
# cleaning the names of a vector: x <- structure(1:3, names = c("name with space", "TwoWords", "total $ (2009)")) x names(x) <- make_clean_names(names(x)) x # now has cleaned names # if you prefer camelCase variable names: make_clean_names(names(x), "small_camel") # similar to janitor::clean_names(poorly_named_df): # not run: # make_clean_names(names(poorly_named_df))
# cleaning the names of a vector: x <- structure(1:3, names = c("name with space", "TwoWords", "total $ (2009)")) x names(x) <- make_clean_names(names(x)) x # now has cleaned names # if you prefer camelCase variable names: make_clean_names(names(x), "small_camel") # similar to janitor::clean_names(poorly_named_df): # not run: # make_clean_names(names(poorly_named_df))
This is a character vector with names of all known Unicode code points that
look like the Greek mu or the micro symbol and values of "u". This is
intended to simplify mapping from mu or micro in Unicode to the character "u"
with clean_names()
and make_clean_names()
.
mu_to_u
mu_to_u
An object of class character
of length 10.
See the help in clean_names()
for how to use this.
Other Set names:
clean_names()
,
find_header()
,
row_to_names()
paste()
, but missing values are omittedLike paste()
, but missing values are omitted
paste_skip_na(..., sep = " ", collapse = NULL)
paste_skip_na(..., sep = " ", collapse = NULL)
... , sep , collapse
|
See |
If all values are missing, the value from the first argument is preserved.
A character vector of pasted values.
paste_skip_na(NA) # NA_character_ paste_skip_na("A", NA) # "A" paste_skip_na("A", NA, c(NA, "B"), sep = ",") # c("A", "A,B")
paste_skip_na(NA) # NA_character_ paste_skip_na("A", NA) # "A" paste_skip_na("A", NA, c(NA, "B"), sep = ",") # c("A", "A,B")
Remove constant columns from a data.frame or matrix.
remove_constant(dat, na.rm = FALSE, quiet = TRUE)
remove_constant(dat, na.rm = FALSE, quiet = TRUE)
dat |
the input data.frame or matrix. |
na.rm |
should |
quiet |
Should messages be suppressed ( |
remove_empty()
for removing empty
columns or rows.
Other remove functions:
remove_empty()
remove_constant(data.frame(A = 1, B = 1:3)) # To find the columns that are constant data.frame(A = 1, B = 1:3) %>% dplyr::select(!dplyr::all_of(names(remove_constant(.)))) %>% unique()
remove_constant(data.frame(A = 1, B = 1:3)) # To find the columns that are constant data.frame(A = 1, B = 1:3) %>% dplyr::select(!dplyr::all_of(names(remove_constant(.)))) %>% unique()
Removes all rows and/or columns from a data.frame or matrix that
are composed entirely of NA
values.
remove_empty(dat, which = c("rows", "cols"), cutoff = 1, quiet = TRUE)
remove_empty(dat, which = c("rows", "cols"), cutoff = 1, quiet = TRUE)
dat |
the input data.frame or matrix. |
which |
one of "rows", "cols", or |
cutoff |
What fraction (>0 to <=1) of rows or columns must be empty to be removed? |
quiet |
Should messages be suppressed ( |
Returns the object without its missing rows or columns.
remove_constant()
for removing
constant columns.
Other remove functions:
remove_constant()
# not run: # dat %>% remove_empty("rows") # addressing a common untidy-data scenario where we have a mixture of # blank values in some (character) columns and NAs in others: library(dplyr) dd <- tibble( x = c(LETTERS[1:5], NA, rep("", 2)), y = c(1:5, rep(NA, 3)) ) # remove_empty() drops row 5 (all NA) but not 6 and 7 (blanks + NAs) dd %>% remove_empty("rows") # solution: preprocess to convert whitespace/empty strings to NA, # _then_ remove empty (all-NA) rows dd %>% mutate(across(where(is.character), ~ na_if(trimws(.), ""))) %>% remove_empty("rows")
# not run: # dat %>% remove_empty("rows") # addressing a common untidy-data scenario where we have a mixture of # blank values in some (character) columns and NAs in others: library(dplyr) dd <- tibble( x = c(LETTERS[1:5], NA, rep("", 2)), y = c(1:5, rep(NA, 3)) ) # remove_empty() drops row 5 (all NA) but not 6 and 7 (blanks + NAs) dd %>% remove_empty("rows") # solution: preprocess to convert whitespace/empty strings to NA, # _then_ remove empty (all-NA) rows dd %>% mutate(across(where(is.character), ~ na_if(trimws(.), ""))) %>% remove_empty("rows")
In base R round()
, halves are rounded to even, e.g., 12.5 and
11.5 are both rounded to 12. This function rounds 12.5 to 13 (assuming
digits = 0
). Negative halves are rounded away from zero, e.g., -0.5 is
rounded to -1.
This may skew subsequent statistical analysis of the data, but may be desirable in certain contexts. This function is implemented exactly from https://stackoverflow.com/a/12688836; see that question and comments for discussion of this issue.
round_half_up(x, digits = 0)
round_half_up(x, digits = 0)
x |
a numeric vector to round. |
digits |
how many digits should be displayed after the decimal point? |
A vector with the same length as x
round_half_up(12.5) round_half_up(1.125, 2) round_half_up(1.125, 1) round_half_up(-0.5, 0) # negatives get rounded away from zero
round_half_up(12.5) round_half_up(1.125, 2) round_half_up(1.125, 1) round_half_up(-0.5, 0) # negatives get rounded away from zero
Round a decimal to the precise decimal value of a specified fractional denominator. Common use cases include addressing floating point imprecision and enforcing that data values fall into a certain set.
E.g., if a decimal represents hours and values should be logged to the nearest
minute, round_to_fraction(x, 60)
would enforce that distribution and 0.57
would be rounded to 0.566667, the equivalent of 34/60. 0.56 would also be rounded
to 34/60.
Set denominator = 1
to round to whole numbers.
The digits
argument allows for rounding of the subsequent result.
round_to_fraction(x, denominator, digits = Inf)
round_to_fraction(x, denominator, digits = Inf)
x |
A numeric vector |
denominator |
The denominator of the fraction for rounding (a scalar or vector positive integer). |
digits |
Integer indicating the number of decimal places to be used
after rounding to the fraction. This is passed to |
If digits
is Inf
, x
is rounded to the fraction
and then kept at full precision. If digits
is "auto"
, the
number of digits is automatically selected as
ceiling(log10(denominator)) + 1
.
the input x rounded to a decimal value that has an integer numerator relative
to denominator
(possibly subsequently rounded to a number of decimal
digits).
round_to_fraction(1.6, denominator = 2) round_to_fraction(pi, denominator = 7) # 22/7 round_to_fraction(c(8.1, 9.2), denominator = c(7, 8)) round_to_fraction(c(8.1, 9.2), denominator = c(7, 8), digits = 3) round_to_fraction(c(8.1, 9.2, 10.3), denominator = c(7, 8, 1001), digits = "auto")
round_to_fraction(1.6, denominator = 2) round_to_fraction(pi, denominator = 7) # 22/7 round_to_fraction(c(8.1, 9.2), denominator = c(7, 8)) round_to_fraction(c(8.1, 9.2), denominator = c(7, 8), digits = 3) round_to_fraction(c(8.1, 9.2, 10.3), denominator = c(7, 8, 1001), digits = "auto")
Elevate a row to be the column names of a data.frame.
row_to_names( dat, row_number, ..., remove_row = TRUE, remove_rows_above = TRUE, sep = "_" )
row_to_names( dat, row_number, ..., remove_row = TRUE, remove_rows_above = TRUE, sep = "_" )
dat |
The input data.frame |
row_number |
The row(s) of |
... |
Sent to |
remove_row |
Should the row |
remove_rows_above |
If |
sep |
A character string to separate the values in the case of multiple
rows input to |
A data.frame with new names (and some rows removed, if specified)
Other Set names:
clean_names()
,
find_header()
,
mu_to_u
x <- data.frame( X_1 = c(NA, "Title", 1:3), X_2 = c(NA, "Title2", 4:6) ) x %>% row_to_names(row_number = 2) x %>% row_to_names(row_number = "find_header")
x <- data.frame( X_1 = c(NA, "Title", 1:3), X_2 = c(NA, "Title2", 4:6) ) x %>% row_to_names(row_number = 2) x %>% row_to_names(row_number = "find_header")
Convert a SAS date, time or date/time to an R object
sas_numeric_to_date(date_num, datetime_num, time_num, tz = "UTC")
sas_numeric_to_date(date_num, datetime_num, time_num, tz = "UTC")
date_num |
numeric vector of serial numbers to convert. |
datetime_num |
numeric vector of date/time numbers (seconds since midnight 1960-01-01) to convert |
time_num |
numeric vector of time numbers (seconds since midnight on the current day) to convert |
tz |
Time zone, used when |
If a date and time or datetime are provided, a POSIXct object. If a date is provided, a Date object. If a time is provided, an hms::hms object
SAS Date, Time, and Datetime Values reference (retrieved on 2022-03-08): https://v8doc.sas.com/sashtml/lrcon/zenid-63.htm
Other date-time cleaning:
convert_to_date()
,
excel_numeric_to_date()
,
excel_time_to_numeric()
sas_numeric_to_date(date_num = 15639) # 2002-10-26 sas_numeric_to_date(datetime_num = 1217083532, tz = "UTC") # 1998-07-26T14:45:32Z sas_numeric_to_date(date_num = 15639, time_num = 3600, tz = "UTC") # 2002-10-26T01:00:00Z sas_numeric_to_date(time_num = 3600) # 01:00:00
sas_numeric_to_date(date_num = 15639) # 2002-10-26 sas_numeric_to_date(datetime_num = 1217083532, tz = "UTC") # 1998-07-26T14:45:32Z sas_numeric_to_date(date_num = 15639, time_num = 3600, tz = "UTC") # 2002-10-26T01:00:00Z sas_numeric_to_date(time_num = 3600) # 01:00:00
In base R signif()
, halves are rounded to even, e.g.,
signif(11.5, 2)
and signif(12.5, 2)
are both rounded to 12.
This function rounds 12.5 to 13 (assuming digits = 2
). Negative halves
are rounded away from zero, e.g., signif(-2.5, 1)
is rounded to -3.
This may skew subsequent statistical analysis of the data, but may be desirable in certain contexts. This function is implemented from https://stackoverflow.com/a/1581007/; see that question and comments for discussion of this issue.
signif_half_up(x, digits = 6)
signif_half_up(x, digits = 6)
x |
a numeric vector to round. |
digits |
integer indicating the number of significant digits to be used. |
signif_half_up(12.5, 2) signif_half_up(1.125, 3) signif_half_up(-2.5, 1) # negatives get rounded away from zero
signif_half_up(12.5, 2) signif_half_up(1.125, 3) signif_half_up(-2.5, 1) # negatives get rounded away from zero
Missing values are replaced with the single value, and if all values are
missing, the first value in missing
is used throughout.
single_value(x, missing = NA, warn_if_all_missing = FALSE, info = NULL)
single_value(x, missing = NA, warn_if_all_missing = FALSE, info = NULL)
x |
The vector which should have a single value |
missing |
The vector of values to consider missing in |
warn_if_all_missing |
Generate a warning if all values are missing? |
info |
If more than one value is found, append this to the warning or error to assist with determining the location of the issue. |
x
as the scalar single value found throughout (or an error if
more than one value is found).
# A simple use case with vectors of input single_value(c(NA, 1)) # Multiple, different values of missing can be given single_value(c(NA, "a"), missing = c(NA, "a")) # A typical use case with a grouped data.frame used for input and the output # (`B` is guaranteed to have a single value and only one row, in this case) data.frame( A = rep(1:3, each = 2), B = c(rep(4:6, each = 2)) ) %>% dplyr::group_by(A) %>% dplyr::summarize( B = single_value(B) ) try( # info is useful to give when multiple values may be found to see what # grouping variable or what calculation is causing the error data.frame( A = rep(1:3, each = 2), B = c(rep(1:2, each = 2), 1:2) ) %>% dplyr::group_by(A) %>% dplyr::mutate( C = single_value(B, info = paste("Calculating C for group A=", A)) ) )
# A simple use case with vectors of input single_value(c(NA, 1)) # Multiple, different values of missing can be given single_value(c(NA, "a"), missing = c(NA, "a")) # A typical use case with a grouped data.frame used for input and the output # (`B` is guaranteed to have a single value and only one row, in this case) data.frame( A = rep(1:3, each = 2), B = c(rep(4:6, each = 2)) ) %>% dplyr::group_by(A) %>% dplyr::summarize( B = single_value(B) ) try( # info is useful to give when multiple values may be found to see what # grouping variable or what calculation is causing the error data.frame( A = rep(1:3, each = 2), B = c(rep(1:2, each = 2), 1:2) ) %>% dplyr::group_by(A) %>% dplyr::mutate( C = single_value(B, info = paste("Calculating C for group A=", A)) ) )
A fully-featured alternative to table()
. Results are data.frames and can be
formatted and enhanced with janitor's family of adorn_
functions.
Specify a data.frame
and the one, two, or three unquoted column names you
want to tabulate. Three variables generates a list of 2-way tabyls,
split by the third variable.
Alternatively, you can tabulate a single variable that isn't in a data.frame
by calling tabyl()
on a vector, e.g., tabyl(mtcars$gear)
.
tabyl(dat, ...) ## Default S3 method: tabyl(dat, show_na = TRUE, show_missing_levels = TRUE, ...) ## S3 method for class 'data.frame' tabyl(dat, var1, var2, var3, show_na = TRUE, show_missing_levels = TRUE, ...)
tabyl(dat, ...) ## Default S3 method: tabyl(dat, show_na = TRUE, show_missing_levels = TRUE, ...) ## S3 method for class 'data.frame' tabyl(dat, var1, var2, var3, show_na = TRUE, show_missing_levels = TRUE, ...)
dat |
A |
... |
Additional arguments passed to methods. |
show_na |
Should counts of |
show_missing_levels |
Should counts of missing levels of factors be displayed? These will be rows and/or columns of zeroes. Useful for keeping consistent output dimensions even when certain factor levels may not be present in the data. |
var1 |
The column name of the first variable. |
var2 |
(optional) the column name of the second variable (the rows in a 2-way tabulation). |
var3 |
(optional) the column name of the third variable (the list in a 3-way tabulation). |
A data.frame
with frequencies and percentages of the tabulated variable(s).
A 3-way tabulation returns a list of data frames.
tabyl(mtcars, cyl) tabyl(mtcars, cyl, gear) tabyl(mtcars, cyl, gear, am) # or using the %>% pipe mtcars %>% tabyl(cyl, gear) # illustrating show_na functionality: my_cars <- rbind(mtcars, rep(NA, 11)) my_cars %>% tabyl(cyl) my_cars %>% tabyl(cyl, show_na = FALSE) # Calling on a single vector not in a data.frame: val <- c("hi", "med", "med", "lo") tabyl(val)
tabyl(mtcars, cyl) tabyl(mtcars, cyl, gear) tabyl(mtcars, cyl, gear, am) # or using the %>% pipe mtcars %>% tabyl(cyl, gear) # illustrating show_na functionality: my_cars <- rbind(mtcars, rep(NA, 11)) my_cars %>% tabyl(cyl) my_cars %>% tabyl(cyl, show_na = FALSE) # Calling on a single vector not in a data.frame: val <- c("hi", "med", "med", "lo") tabyl(val)
Get a frequency table of a factor variable, grouped into categories by level.
top_levels(input_vec, n = 2, show_na = FALSE)
top_levels(input_vec, n = 2, show_na = FALSE)
input_vec |
The factor variable to tabulate. |
n |
Number of levels to include in top and bottom groups |
show_na |
Should cases where the variable is |
A data.frame
(actually a tbl_df
) with the frequencies of the
grouped, tabulated variable. Includes counts and percentages, and valid
percentages (calculated omitting NA
values, if present in the vector and
show_na = TRUE
.)
top_levels(as.factor(mtcars$hp), 2)
top_levels(as.factor(mtcars$hp), 2)
tabyl
attributes from a data.frame.Strips away all tabyl
-related attributes from a data.frame.
untabyl(dat)
untabyl(dat)
dat |
a |
the same data.frame
, but without the tabyl
class and attributes.
mtcars %>% tabyl(am) %>% untabyl() %>% attributes() # tabyl-specific attributes are gone
mtcars %>% tabyl(am) %>% untabyl() %>% attributes() # tabyl-specific attributes are gone