RebalancedCV: Solving distributional bias during cross-validation

RebalancedCV contains the following three classes: RebalancedLeaveOneOut, RebalancedKFold, and RebalancedLeavePOut. These are implementations for Leave One Out, Stratified K-Fold, and Leave P Out cross-validation with sub-sampling within training sets to ensure that, for any cross-validation scheme, the class balances across all training folds are identical.

Background & Analysis

It was recently shown that removing a fraction of a dataset into a testing fold can artificially create a shift in label averages across training folds that is inversely correlated with that of their corresponding test folds. To address the issue, this package automatically subsamples points from within the training set to remove any differences in label average across training folds.

The following example illustrates how this small correction can have a large impact on machine learning performance metrics, with code demonstrations of scikit-learn’s LeaveOneOut and RebalancedCV’s RebalancedLeaveOneOut. This analysis uses a randomly generated observation matrix X and a binary outcome vector y. By construction, this is an example for which a fair evaluation of a machine learning model should be close to the expected performance of a random guess, which is 0.5.

import numpy as np 
import pandas as pd
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import LeaveOneOut
from rebalancedcv import RebalancedLeaveOneOut
from sklearn.metrics import roc_auc_score
np.random.seed(1)

## given some random `X` matrix, and a `y` binary vector
X = np.random.rand(100, 10)
y = np.random.rand(100) > 0.5

## Leave-one-out evaluation
loo = LeaveOneOut()
loocv_predictions = [ LogisticRegressionCV()\
                                .fit(X[train_index], y[train_index])\
                                .predict_proba(X[test_index]
                                            )[:, 1][0]
              for train_index, test_index in loo.split(X, y) ]

## Since all the data is random, a fair evaluation
## should yield au auROC close to 0.5
print('Leave One Out auROC: {:.2f}'\
              .format( roc_auc_score(y, loocv_predictions) ) )

## Rebalanced leave-one-out evaluation
rloo = RebalancedLeaveOneOut()
rloocv_predictions = [ LogisticRegressionCV()\
                                .fit(X[train_index], y[train_index])\
                                .predict_proba(X[test_index]
                                            )[:, 1][0]
              for train_index, test_index in rloo.split(X, y) ]

## Since all the data is random, a fair evaluation
## should yield au auROC close to 0.5
print('Rebalanceed Leave-one-out auROC: {:.2f}'\
              .format(  roc_auc_score(y, rloocv_predictions) ) )

Leave One Out auROC: 0.00
Rebalanceed Leave-one-out auROC: 0.48

As demonstrated in this example, neglecting to account for distributional bias in the cross-validation classes can greatly decrease evaluated model performance. For more details on why this happens, please refer to our manuscript.

We note that the example’s code structure approach would apply to this package’s other RebalancedKFold and RebalancedLeavePOut classes.

All classes from this package provide train/test indices to split data in train/test sets while rebalancing the training set to account for distributional bias. This package is designed to enable automated rebalancing for the cross-validation implementations in scikit-learn’s LeaveOneOut, StratifiedKFold, and LeavePOut, through the RebalancedCV classes RebalancedLeaveOneOut, RebalancedKFold, and RebalancedLeavePOut. These Rebalanced classes are designed to work in the exact same code structure and implementation use cases as their scikit-learn equivalents, with the only difference being a subsampling within the provided training indices.

For more details, we provide further background and walkthroughs for each of these three classes in their respective documentation tabs.

Please report any issues faced while using RebalancedCV in our Issues Page, or email gia2105@columbia.edu.

RebalancedCV: Solving distributional bias during cross-validation

Installation

Loading

Background & Analysis

Citation