Clustering: Income with R

Using tidymodels

1 Introduction

This document demonstrates how to perform clustering in R using the tidymodels framework. Clustering is an unsupervised learning technique that groups similar data points together based on their inherent characteristics. We will use the adult_income_dataset.csv for this demonstration.

2 Load Data

First, we load the necessary libraries and the income dataset.

Code
library(tidyverse)
library(tidymodels)
library(factoextra)

income_data <- read_csv("../data/adult_income_dataset.csv")
Code
glimpse(income_data)
Rows: 32,561
Columns: 15
$ age              <dbl> 39, 50, 38, 53, 28, 37, 49, 52, 31, 42, 37, 30, 23, 3…
$ workclass        <chr> "State-gov", "Self-emp-not-inc", "Private", "Private"…
$ fnlwgt           <dbl> 77516, 83311, 215646, 234721, 338409, 284582, 160187,…
$ education        <chr> "Bachelors", "Bachelors", "HS-grad", "11th", "Bachelo…
$ `education-num`  <dbl> 13, 13, 9, 7, 13, 14, 5, 9, 14, 13, 10, 13, 13, 12, 1…
$ `marital-status` <chr> "Never-married", "Married-civ-spouse", "Divorced", "M…
$ occupation       <chr> "Adm-clerical", "Exec-managerial", "Handlers-cleaners…
$ relationship     <chr> "Not-in-family", "Husband", "Not-in-family", "Husband…
$ race             <chr> "White", "White", "White", "Black", "Black", "White",…
$ sex              <chr> "Male", "Male", "Male", "Male", "Female", "Female", "…
$ `capital-gain`   <dbl> 2174, 0, 0, 0, 0, 0, 0, 0, 14084, 5178, 0, 0, 0, 0, 0…
$ `capital-loss`   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `hours-per-week` <dbl> 40, 13, 40, 40, 40, 40, 16, 45, 50, 40, 80, 40, 30, 5…
$ `native-country` <chr> "United-States", "United-States", "United-States", "U…
$ income           <chr> "<=50K", "<=50K", "<=50K", "<=50K", "<=50K", "<=50K",…
Code
income_data |> count(race)
# A tibble: 5 × 2
  race                   n
  <chr>              <int>
1 Amer-Indian-Eskimo   311
2 Asian-Pac-Islander  1039
3 Black               3124
4 Other                271
5 White              27816
Code
# For simplicity, we'll remove rows with any missing values
income_data_clean <- income_data %>%
  select(-race) %>%
  na.omit() %>%
  sample_n(1000) # Randomly sample 1000 rows
Code
glimpse(income_data_clean)
Rows: 1,000
Columns: 14
$ age              <dbl> 25, 41, 27, 25, 41, 38, 28, 29, 51, 42, 35, 69, 35, 3…
$ workclass        <chr> "Private", "Private", "Self-emp-inc", "Private", "Pri…
$ fnlwgt           <dbl> 161478, 193882, 120126, 71351, 174201, 203169, 203260…
$ education        <chr> "Bachelors", "Bachelors", "Bachelors", "1st-4th", "HS…
$ `education-num`  <dbl> 13, 13, 13, 2, 9, 11, 13, 10, 13, 13, 13, 10, 9, 13, …
$ `marital-status` <chr> "Never-married", "Married-civ-spouse", "Married-civ-s…
$ occupation       <chr> "Prof-specialty", "Farming-fishing", "Exec-managerial…
$ relationship     <chr> "Not-in-family", "Husband", "Husband", "Other-relativ…
$ sex              <chr> "Female", "Male", "Male", "Male", "Male", "Male", "Ma…
$ `capital-gain`   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `capital-loss`   <dbl> 0, 0, 1887, 0, 0, 0, 0, 0, 2415, 0, 0, 0, 0, 0, 1977,…
$ `hours-per-week` <dbl> 40, 40, 45, 25, 40, 50, 8, 40, 40, 48, 60, 32, 38, 39…
$ `native-country` <chr> "United-States", "United-States", "United-States", "E…
$ income           <chr> "<=50K", "<=50K", ">50K", "<=50K", "<=50K", ">50K", "…
Code
# Preprocessing using recipes
library(recipes)

income_recipe <- recipe(~ ., data = income_data_clean) %>%
  step_dummy(all_nominal_predictors()) %>% # One-hot encode all nominal (categorical) predictors
  step_normalize(all_numeric_predictors()) %>% # Normalize all numerical predictors
  prep(training = income_data_clean)

income_data_processed <- bake(income_recipe, new_data = income_data_clean)

# Remove any columns that might have resulted in all zeros after one-hot encoding if they were constant
income_data_processed <- income_data_processed[, colSums(income_data_processed) != 0]

3 Elbow Method

The Elbow Method is a heuristic used to determine the optimal number of clusters in a dataset. We can visualize the total within-cluster sum of squares as a function of the number of clusters.

Code
fviz_nbclust(income_data_processed, kmeans, method = "wss") +
  labs(subtitle = "Elbow Method")

4 K-Means Clustering

K-Means is a popular clustering algorithm. We will use it to group the income data into clusters. The optimal number of clusters can be determined from the Elbow Method plot.

Code
set.seed(123)
kmeans_model <- kmeans(income_data_processed, centers = 5, nstart = 25) # Assuming 2 clusters from Elbow Method

# Visualize the clusters (using first two principal components for visualization)
fviz_cluster(kmeans_model, data = income_data_processed)

5 Hierarchical Clustering

Hierarchical clustering is another common clustering method.

Code
# Calculate the distance matrix
dist_matrix <- dist(income_data_processed, method = "euclidean")

# Perform hierarchical clustering
hclust_model <- hclust(dist_matrix, method = "ward.D2")

# Visualize the dendrogram
fviz_dend(hclust_model, k = 5, 
          cex = 0.5, # label size
          k_colors = c("#2E9FDF", "#00AFBB", "#E7B800"),
          color_labels_by_k = TRUE, # color labels by groups
          rect = TRUE # Add rectangle around groups
          )

6 Comparison of K-Means and Hierarchical Clustering

Here’s a comparison of K-Means and Hierarchical Clustering:

Feature | K-Means Clustering | Hierarchical Clustering |

|:——————–|:————————————————-|:—————-:————————————-| | Approach | Partitioning (divides data into k clusters) | Agglomerative (bottom-up) or Divisive (top-down) | | Number of Clusters | Requires pre-specification (k) | Does not require pre-specification; dendrogram helps | | Computational Cost | Faster for large datasets | Slower for large datasets (O(n^3) or O(n^2)) | | Cluster Shape | Tends to form spherical clusters | Can discover arbitrarily shaped clusters | | Sensitivity to Outliers | Sensitive to outliers | Less sensitive to outliers | | Interpretability | Easy to interpret | Dendrogram can be complex for large datasets | | Reproducibility | Can vary with initial centroids (unless fixed) | Reproducible |

7 Conclusion

This document provided a brief overview of clustering in R using tidymodels. We demonstrated both K-Means and Hierarchical clustering on the income dataset.