Examples of computationally sensible and user-friendly function writing.
Quarto
R
Author
Matt Babb
Published
June 7, 2025
This assignment will challenge your function writing abilities. I’m not going to lie, these functions are difficult but well within your reach. I do, however, want to recognize that not everyone is interested in being a “virtuoso” with their function writing. So, there are two options for this week’s lab:
Option 1: Complete this lab assignment in search of virtuoso status with your function writing
Option 2: Complete one of the difficult functions (Exercise 1 or Exercise
and complete the “Alternative Lab 6”.
Setting the Stage
My number one use case for writing functions and iteration / looping is to perform some exploration or modeling repeatedly for different “tweaked” versions. For example, our broad goal might be to fit a linear regression model to our data. However, there are often multiple choices that we have to make in practice:
Keep missing values or fill them in (imputation)?
Filter out outliers in one or more variables?
We can map these choices to arguments in a custom model-fitting function:
impute: TRUE or FALSE
remove_outliers: TRUE or FALSE
A function that implements the analysis and allows for variation in these choices:
Exercise 1: Write a function that removes outliers in a dataset. The user should be able to supply the dataset, the variables to remove outliers from, and a threshold on the number of SDs away from the mean used to define outliers. Hint 1: You will need to calculate a z-score to filter the values!Hint 2: You might want to consider specifying a default value (e.g., 3) for sd_thresh.
Code
library(tidyverse)remove_outliers <-function(df, ..., sd_thresh =3) { vars <- rlang::enquos(...)# Check all selected variables are numeric all_numeric <- purrr::map_lgl(vars, ~is.numeric(dplyr::pull(df, !!.x)))if (!all(all_numeric)) {stop("All selected variables must be numeric.") }# Filter rows using if_all and tidy eval df %>%filter(if_all(all_of(tidyselect::vars_select(names(df), !!!vars)),~abs((.x -mean(.x, na.rm =TRUE)) /sd(.x, na.rm =TRUE)) < sd_thresh))}
Testing Your Function!
Code
## Testing how your function handles multiple input variablesremove_outliers(diamonds, price, x, y, z)
# A tibble: 52,689 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ℹ 52,679 more rows
Code
## Testing how your function handles an input that isn't numericremove_outliers(diamonds, price, color)
Error in remove_outliers(diamonds, price, color): All selected variables must be numeric.
Code
## Testing how your function handles a non-default sd_threshremove_outliers(diamonds, price, x, y, z, sd_thresh =2)
# A tibble: 50,099 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ℹ 50,089 more rows
Exercise 2: Write a function that imputes missing values for numeric variables in a dataset. The user should be able to supply the dataset, the variables to impute values for, and a function to use when imputing. Hint 1: You will need to use across() to apply your function, since the user can input multiple variables.Hint 2: The replace_na() function is helpful here!
Code
library(nycflights13)impute_missing <-function(df, ..., impute_fun = mean) {# Capture the selected variables vars <- rlang::enquos(...)# Check all selected variables are numeric all_numeric <- purrr::map_lgl(vars, ~is.numeric(dplyr::pull(df, !!.x)))if (!all(all_numeric)) {stop("All selected variables must be numeric.") }df %>%mutate(across(all_of(tidyselect::vars_select(names(df), !!!vars)),~replace_na(.x, impute_fun(.x, na.rm =TRUE)) ))}
Testing Your Function!
Code
## Testing how your function handles multiple input variablesimpute_missing(nycflights13::flights, arr_delay, dep_delay)
# Testing that there are no NAs in the imputed datasetflights_clean <-impute_missing(nycflights13::flights, arr_delay, dep_delay)# Check for remaining NAsflights_clean %>%summarize(arr_delay_missing =sum(is.na(arr_delay)),dep_delay_missing =sum(is.na(dep_delay)) )
# Checking to see if means and medians return different datasetsflights_mean <-impute_missing(nycflights13::flights, arr_delay, dep_delay)flights_median <-impute_missing(nycflights13::flights, arr_delay, dep_delay, impute_fun = median)all.equal(flights_mean, flights_median)
[1] "Component \"dep_delay\": Mean relative difference: 1.158239"
[2] "Component \"arr_delay\": Mean relative difference: 1.725124"
Primary Function
Exercise 3: Write a fit_model() function that fits a specified linear regression model for a specified dataset. The function should:
allow the user to specify if outliers should be removed (TRUE or FALSE)
allow the user to specify if missing observations should be imputed (TRUE or FALSE)
If either option is TRUE, your function should call your remove_outliers() or impute_missing() functions to modify the data before the regression model is fit.
In the diamonds dataset, we want to understand the relationship between price and size (carat). We want to explore variation along two choices:
The variables included in the model. We’ll explore 3 sets of variables:
No further variables (just price and carat)
Adjusting for cut
Adjusting for cut and clarity
Adjusting for cut, clarity, and color
Whether or not to impute missing values
Whether or not to remove outliers in the carat variable (we’ll define outliers as cases whose carat is over 3 SDs away from the mean).
Parameters
First, we need to define the set of parameters we want to iterate the fit_model() function over. The tidyr package has a useful function called crossing() that is useful for generating argument combinations. For each argument, we specify all possible values for that argument and crossing() generates all combinations. Note that you can create a list of formula objects in R with c(y ~ x1, y ~ x1 + x2).
Code
df_arg_combos <-crossing(impute =c(TRUE, FALSE),remove_outliers =c(TRUE, FALSE), mod =c(y ~ x1, y ~ x1 + x2))df_arg_combos
Exercise 4: Use crossing() to create the data frame of argument combinations for our analyses.
---title: "Writing Efficient Functions"description: "Examples of computationally sensible and user-friendly function writing."author: - name: Matt Babbdate: 06-07-2025categories: [Quarto, R] # self-defined categoriesimage: function_thumbnail.jpgdraft: false # setting this to `true` will prevent your post from appearing on your listing page until you're ready!editor: sourceembed-resources: trueecho: truewarning: falseerror: falsecode-fold: truecode-tools: true---This assignment will challenge your function writing abilities. I'm not going to lie, these functions are difficult but well within your reach. I do, however,want to recognize that not everyone is interested in being a "virtuoso" with their function writing. So, there are two options for this week's lab:- **Option 1:** Complete this lab assignment in search of virtuoso status withyour function writing- **Option 2:** Complete one of the difficult functions (Exercise 1 or Exercise2) and complete the "Alternative Lab 6". # Setting the StageMy number one use case for writing functions and iteration / looping is toperform some exploration or modeling repeatedly for different "tweaked"versions. For example, our broad goal might be to fit a linear regression modelto our data. However, there are often multiple choices that we have to make inpractice:- Keep missing values or fill them in (imputation)? - Filter out outliers in one or more variables?We can map these choices to **arguments** in a custom model-fitting function:- `impute`: TRUE or FALSE- `remove_outliers`: TRUE or FALSEA function that implements the analysis and allows for variation in these choices:```{r}#| echo: true#| eval: false#| label: example-code-to-motivate-functionfit_model <-function(df, impute, remove_outliers, mod) {if (impute) { df <-some_imputation_function(df) }if (remove_outliers) { df <-function_for_removing_outliers(df) }lm(mod, data = df)}```# Helper Functions**Exercise 1:** Write a function that removes outliers in a dataset. The usershould be able to supply the dataset, the variables to remove outliers from, anda threshold on the number of SDs away from the mean used to define outliers. *Hint 1: You will need to calculate a z-score to filter the values!**Hint 2: You might want to consider specifying a default value (e.g., 3) for `sd_thresh`.*```{r}#| label: exercise-1library(tidyverse)remove_outliers <-function(df, ..., sd_thresh =3) { vars <- rlang::enquos(...)# Check all selected variables are numeric all_numeric <- purrr::map_lgl(vars, ~is.numeric(dplyr::pull(df, !!.x)))if (!all(all_numeric)) {stop("All selected variables must be numeric.") }# Filter rows using if_all and tidy eval df %>%filter(if_all(all_of(tidyselect::vars_select(names(df), !!!vars)),~abs((.x -mean(.x, na.rm =TRUE)) /sd(.x, na.rm =TRUE)) < sd_thresh))}```## Testing Your Function! ```{r}#| label: exercise-1-test#| error: true## Testing how your function handles multiple input variablesremove_outliers(diamonds, price, x, y, z)## Testing how your function handles an input that isn't numericremove_outliers(diamonds, price, color)## Testing how your function handles a non-default sd_threshremove_outliers(diamonds, price, x, y, z, sd_thresh =2)```**Exercise 2:** Write a function that imputes missing values for numeric variables in a dataset. The user should be able to supply the dataset, the variables to impute values for, and a function to use when imputing. *Hint 1: You will need to use `across()` to apply your function, since the user can input multiple variables.**Hint 2: The `replace_na()` function is helpful here!*```{r}#| label: exercise-2library(nycflights13)impute_missing <-function(df, ..., impute_fun = mean) {# Capture the selected variables vars <- rlang::enquos(...)# Check all selected variables are numeric all_numeric <- purrr::map_lgl(vars, ~is.numeric(dplyr::pull(df, !!.x)))if (!all(all_numeric)) {stop("All selected variables must be numeric.") }df %>%mutate(across(all_of(tidyselect::vars_select(names(df), !!!vars)),~replace_na(.x, impute_fun(.x, na.rm =TRUE)) ))}```## Testing Your Function! ```{r}#| label: exercise-2-test#| error: true## Testing how your function handles multiple input variablesimpute_missing(nycflights13::flights, arr_delay, dep_delay) # Testing that there are no NAs in the imputed datasetflights_clean <-impute_missing(nycflights13::flights, arr_delay, dep_delay)# Check for remaining NAsflights_clean %>%summarize(arr_delay_missing =sum(is.na(arr_delay)),dep_delay_missing =sum(is.na(dep_delay)) )## Testing how your function handles an input that isn't numericimpute_missing(nycflights13::flights, arr_delay, carrier)## Testing how your function handles a non-default impute_funimpute_missing(nycflights13::flights, arr_delay, dep_delay, impute_fun = median)# Checking to see if means and medians return different datasetsflights_mean <-impute_missing(nycflights13::flights, arr_delay, dep_delay)flights_median <-impute_missing(nycflights13::flights, arr_delay, dep_delay, impute_fun = median)all.equal(flights_mean, flights_median)```# Primary Function**Exercise 3:** Write a `fit_model()` function that fits a specified linearregression model for a specified dataset. The function should:- allow the user to specify if outliers should be removed (`TRUE` or `FALSE`)- allow the user to specify if missing observations should be imputed(`TRUE` or `FALSE`)If either option is `TRUE`, your function should call your `remove_outliers()`or `impute_missing()` functions to modify the data **before** the regression model is fit. ```{r}#| label: exercise-3fit_model <-function(df, mod_formula,remove_outliers =FALSE,impute_missing =FALSE,impute_fun = mean,sd_thresh =3, ...) {if (impute_missing) { df <-impute_missing(df, ..., impute_fun = mean) }if (remove_outliers) { df <-remove_outliers(df, ..., sd_thresh =3) } model <-lm(mod_formula, data = df)return(model)}```## Testing Your Function! ```{r}#| label: exercise-3-testfit_model( diamonds,mod_formula = price ~ carat + cut,remove_outliers =TRUE,impute_missing =TRUE, price, carat)```# IterationIn the `diamonds` dataset, we want to understand the relationship between`price` and size (`carat`). We want to explore variation along two choices:1. The variables included in the model. We'll explore 3 sets of variables: - No further variables (just `price` and `carat`) - Adjusting for `cut` - Adjusting for `cut` and `clarity` - Adjusting for `cut`, `clarity`, and `color`2. Whether or not to impute missing values3. Whether or not to remove outliers in the `carat` variable (we'll defineoutliers as cases whose `carat` is over 3 SDs away from the mean).## ParametersFirst, we need to define the set of parameters we want to iterate the`fit_model()` function over. The `tidyr` package has a useful function called`crossing()` that is useful for generating argument combinations. For eachargument, we specify all possible values for that argument and `crossing()`generates all combinations. *Note that you can create a list of formula objects in R with `c(y ~ x1, y ~ x1 + x2)`.*```{r}#| label: example-crossing-arguments#| eval: falsedf_arg_combos <-crossing(impute =c(TRUE, FALSE),remove_outliers =c(TRUE, FALSE), mod =c(y ~ x1, y ~ x1 + x2))df_arg_combos```**Exercise 4:** Use `crossing()` to create the data frame of argumentcombinations for our analyses. ```{r}#| label: exercise-4formulas <-list( price ~ carat, price ~ carat + cut, price ~ carat + cut + clarity, price ~ carat + cut + clarity + color)df_arg_combos <- tidyr::crossing(impute =c(TRUE, FALSE),remove_outliers =c(TRUE, FALSE),mod_formula = formulas)df_arg_combos```## Iterating Over the ParametersWe've arrived at the final step! **Exercise 5:** Use `pmap()` from `purrr` to apply the `fit_model()` function toevery combination of arguments from `diamonds.```{r}#| label: exercise-5library(purrr)model_results <- df_arg_combos %>%mutate(model =pmap(list(impute, remove_outliers, mod_formula),function(impute, remove_outliers, mod_formula) {fit_model(df = diamonds,mod_formula = mod_formula,impute_missing = impute,remove_outliers = remove_outliers,# vars for cleaning: price, carat ) } ))print(model_results, n =50)```