This assignment will challenge your function writing abilities. I’m not going to lie, these functions are difficult but well within your reach. I do, however, want to recognize that not everyone is interested in being a “virtuoso” with their function writing. So, there are two options for this week’s lab:

Option 1: Complete this lab assignment in search of virtuoso status with your function writing
Option 2: Complete one of the difficult functions (Exercise 1 or Exercise

and complete the “Alternative Lab 6”.

Setting the Stage

My number one use case for writing functions and iteration / looping is to perform some exploration or modeling repeatedly for different “tweaked” versions. For example, our broad goal might be to fit a linear regression model to our data. However, there are often multiple choices that we have to make in practice:

Keep missing values or fill them in (imputation)?
Filter out outliers in one or more variables?

We can map these choices to arguments in a custom model-fitting function:

impute: TRUE or FALSE
remove_outliers: TRUE or FALSE

A function that implements the analysis and allows for variation in these choices:

Code

fit_model <- function(df, impute, remove_outliers, mod) {
    if (impute) {
        df <- some_imputation_function(df)
    }
    
    if (remove_outliers) {
        df <- function_for_removing_outliers(df)
    }
    
    lm(mod, data = df)
}

Helper Functions

Exercise 1: Write a function that removes outliers in a dataset. The user should be able to supply the dataset, the variables to remove outliers from, and a threshold on the number of SDs away from the mean used to define outliers. Hint 1: You will need to calculate a z-score to filter the values! Hint 2: You might want to consider specifying a default value (e.g., 3) for sd_thresh.

Code

library(tidyverse)

remove_outliers <- function(df, ..., sd_thresh = 3) {
  vars <- rlang::enquos(...)

  # Check all selected variables are numeric
  all_numeric <- purrr::map_lgl(vars, ~ is.numeric(dplyr::pull(df, !!.x)))
  if (!all(all_numeric)) {
    stop("All selected variables must be numeric.")
  }

  # Filter rows using if_all and tidy eval
  df %>%
    filter(if_all(all_of(tidyselect::vars_select(names(df), !!!vars)),
                  ~ abs((.x - mean(.x, na.rm = TRUE)) / sd(.x, na.rm = TRUE)) < sd_thresh))
}

Testing Your Function!

Code

## Testing how your function handles multiple input variables
remove_outliers(diamonds, 
                price, 
                x, 
                y, 
                z)

# A tibble: 52,689 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 52,679 more rows

Code

## Testing how your function handles an input that isn't numeric
remove_outliers(diamonds, 
                price, 
                color)

Error in remove_outliers(diamonds, price, color): All selected variables must be numeric.

Code

## Testing how your function handles a non-default sd_thresh
remove_outliers(diamonds, 
                price,
                x, 
                y, 
                z, 
                sd_thresh = 2)

# A tibble: 50,099 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 50,089 more rows

Exercise 2: Write a function that imputes missing values for numeric variables in a dataset. The user should be able to supply the dataset, the variables to impute values for, and a function to use when imputing. Hint 1: You will need to use across() to apply your function, since the user can input multiple variables. Hint 2: The replace_na() function is helpful here!

Code

library(nycflights13)

impute_missing <- function(df, ..., impute_fun = mean) {
  # Capture the selected variables
  vars <- rlang::enquos(...)
  
  # Check all selected variables are numeric
  all_numeric <- purrr::map_lgl(vars, ~ is.numeric(dplyr::pull(df, !!.x)))
  if (!all(all_numeric)) {
    stop("All selected variables must be numeric.")
  }

df %>%
  mutate(across(
    all_of(tidyselect::vars_select(names(df), !!!vars)),
    ~ replace_na(.x, impute_fun(.x, na.rm = TRUE))
  ))

}

Testing Your Function!

Code

## Testing how your function handles multiple input variables
impute_missing(nycflights13::flights, 
               arr_delay, 
               dep_delay)

# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Code

# Testing that there are no NAs in the imputed dataset

flights_clean <- impute_missing(nycflights13::flights, arr_delay, dep_delay)

# Check for remaining NAs
flights_clean %>%
  summarize(
    arr_delay_missing = sum(is.na(arr_delay)),
    dep_delay_missing = sum(is.na(dep_delay))
  )

# A tibble: 1 × 2
  arr_delay_missing dep_delay_missing
              <int>             <int>
1                 0                 0

Code

## Testing how your function handles an input that isn't numeric
impute_missing(nycflights13::flights, 
               arr_delay, 
               carrier)

Error in impute_missing(nycflights13::flights, arr_delay, carrier): All selected variables must be numeric.

Code

## Testing how your function handles a non-default impute_fun
impute_missing(nycflights13::flights, 
               arr_delay, 
               dep_delay, 
               impute_fun = median)

# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Code

# Checking to see if means and medians return different datasets
flights_mean <- impute_missing(nycflights13::flights, arr_delay, dep_delay)
flights_median <- impute_missing(nycflights13::flights, arr_delay, dep_delay, impute_fun = median)

all.equal(flights_mean, flights_median)

[1] "Component \"dep_delay\": Mean relative difference: 1.158239"
[2] "Component \"arr_delay\": Mean relative difference: 1.725124"

Primary Function

Exercise 3: Write a fit_model() function that fits a specified linear regression model for a specified dataset. The function should:

allow the user to specify if outliers should be removed (TRUE or FALSE)
allow the user to specify if missing observations should be imputed (TRUE or FALSE)

If either option is TRUE, your function should call your remove_outliers() or impute_missing() functions to modify the data before the regression model is fit.

Code

fit_model <- function(df,
                      mod_formula,
                      remove_outliers = FALSE,
                      impute_missing = FALSE,
                      impute_fun = mean,
                      sd_thresh = 3,
                      ...) {

  if (impute_missing) {
    df <- impute_missing(df, ..., impute_fun = mean)
  }

  if (remove_outliers) {
    df <- remove_outliers(df, ..., sd_thresh = 3)
  }

  model <- lm(mod_formula, data = df)

  return(model)
}

Testing Your Function!

Code

fit_model(
  diamonds,
  mod_formula = price ~ carat + cut,
  remove_outliers = TRUE,
  impute_missing = TRUE,
  price, 
  carat
)


Call:
lm(formula = mod_formula, data = df)

Coefficients:
(Intercept)        carat        cut.L        cut.Q        cut.C        cut^4  
   -2701.38      7871.08      1239.80      -528.60       367.91        74.59

Iteration

In the diamonds dataset, we want to understand the relationship between price and size (carat). We want to explore variation along two choices:

The variables included in the model. We’ll explore 3 sets of variables:
- No further variables (just price and carat)
- Adjusting for cut
- Adjusting for cut and clarity
- Adjusting for cut, clarity, and color
Whether or not to impute missing values
Whether or not to remove outliers in the carat variable (we’ll define outliers as cases whose carat is over 3 SDs away from the mean).

Parameters

First, we need to define the set of parameters we want to iterate the fit_model() function over. The tidyr package has a useful function called crossing() that is useful for generating argument combinations. For each argument, we specify all possible values for that argument and crossing() generates all combinations. Note that you can create a list of formula objects in R with c(y ~ x1, y ~ x1 + x2).

Code

df_arg_combos <- crossing(
    impute = c(TRUE, FALSE),
    remove_outliers = c(TRUE, FALSE), 
    mod = c(y ~ x1, 
            y ~ x1 + x2)
)
df_arg_combos

Exercise 4: Use crossing() to create the data frame of argument combinations for our analyses.

Code

formulas <- list(
  price ~ carat,
  price ~ carat + cut,
  price ~ carat + cut + clarity,
  price ~ carat + cut + clarity + color
)

df_arg_combos <- tidyr::crossing(
  impute = c(TRUE, FALSE),
  remove_outliers = c(TRUE, FALSE),
  mod_formula = formulas
)

df_arg_combos

# A tibble: 16 × 3
   impute remove_outliers mod_formula
   <lgl>  <lgl>           <list>     
 1 FALSE  FALSE           <formula>  
 2 FALSE  FALSE           <formula>  
 3 FALSE  FALSE           <formula>  
 4 FALSE  FALSE           <formula>  
 5 FALSE  TRUE            <formula>  
 6 FALSE  TRUE            <formula>  
 7 FALSE  TRUE            <formula>  
 8 FALSE  TRUE            <formula>  
 9 TRUE   FALSE           <formula>  
10 TRUE   FALSE           <formula>  
11 TRUE   FALSE           <formula>  
12 TRUE   FALSE           <formula>  
13 TRUE   TRUE            <formula>  
14 TRUE   TRUE            <formula>  
15 TRUE   TRUE            <formula>  
16 TRUE   TRUE            <formula>

Iterating Over the Parameters

We’ve arrived at the final step!

Exercise 5: Use pmap() from purrr to apply the fit_model() function to every combination of arguments from `diamonds.

Code

library(purrr)

model_results <- df_arg_combos %>%
  mutate(model = pmap(
    list(impute, remove_outliers, mod_formula),
    function(impute, remove_outliers, mod_formula) {
      fit_model(
        df = diamonds,
        mod_formula = mod_formula,
        impute_missing = impute,
        remove_outliers = remove_outliers,
        # vars for cleaning:
        price, carat
      )
    }
  ))

print(model_results, n = 50)

# A tibble: 16 × 4
   impute remove_outliers mod_formula model 
   <lgl>  <lgl>           <list>      <list>
 1 FALSE  FALSE           <formula>   <lm>  
 2 FALSE  FALSE           <formula>   <lm>  
 3 FALSE  FALSE           <formula>   <lm>  
 4 FALSE  FALSE           <formula>   <lm>  
 5 FALSE  TRUE            <formula>   <lm>  
 6 FALSE  TRUE            <formula>   <lm>  
 7 FALSE  TRUE            <formula>   <lm>  
 8 FALSE  TRUE            <formula>   <lm>  
 9 TRUE   FALSE           <formula>   <lm>  
10 TRUE   FALSE           <formula>   <lm>  
11 TRUE   FALSE           <formula>   <lm>  
12 TRUE   FALSE           <formula>   <lm>  
13 TRUE   TRUE            <formula>   <lm>  
14 TRUE   TRUE            <formula>   <lm>  
15 TRUE   TRUE            <formula>   <lm>  
16 TRUE   TRUE            <formula>   <lm>