Review:
Cluster Centroids Undersampling
overall review score: 3.8
⭐⭐⭐⭐
score is between 0 and 5
Cluster-centroids-undersampling is a data balancing technique used in imbalanced classification problems. It involves reducing the majority class by replacing clusters of majority class samples with their centroids, effectively representing each cluster with its center point. This method aims to mitigate the bias towards the majority class by decreasing its dominance while preserving the overall structure of data distribution.
Key Features
- Reduces the size of the majority class data set through clustering
- Replaces clusters with their centroid points for simplified representation
- Helps improve model performance on imbalanced datasets
- Maintains the overall data distribution while undersampling
- Integrates well with machine learning workflows, especially ensemble and tree-based models
Pros
- Effective at balancing datasets without heavily sacrificing data variance
- Reduces training time and computational load due to smaller dataset size
- Preserves critical information about data distribution through centroid representation
Cons
- Potential loss of detailed minority class information
- Clustering parameters (e.g., number of clusters) require careful tuning
- Possible to oversimplify data structure leading to decreased model performance in some cases
- Less effective if data clusters are not well-defined or overlapping