Welcome to RebalancedCV!
This is a python package designed to facilitate correcting for
distributional bias during cross validation.
RebalancedCV contains the following three classes: RebalancedLeaveOneOut,
RebalancedKFold, RebalancedLeavePOut,
and RebalancedLeaveOneOutRegression.
These are implementations for Leave One Out
,
Stratified K-Fold
, and Leave P Out
cross-validation with sub-sampling within training sets to ensure that,
for any cross-validation scheme, the class balances across all training
folds are identical.
For any support using RebalancedCV, please use our
issues
page or email gia2105@columbia.edu.
pip install RebalancedCV
The only dependencies are python
and
scikit-learn
from rebalancedcv import RebalancedLeaveOneOut, RebalancedKFold, RebalancedLeavePOut, RebalancedLeaveOneOutRegression
It was recently shown that removing a fraction of a dataset into a testing fold can artificially create a shift in label averages across training folds that is inversely correlated with that of their corresponding test folds. To address the issue, this package automatically subsamples points from within the training set to remove any differences in label average across training folds.
The following example illustrates how this small correction can have
a large impact on machine learning performance metrics, with code
demonstrations of scikit-learn’s LeaveOneOut
and
RebalancedCV’s RebalancedLeaveOneOut
. This analysis uses a
randomly generated observation matrix X
and a binary
outcome vector y
. By construction, this is an example for
which a fair evaluation of a machine learning model should be close to
the expected performance of a random guess, which is 0.5.
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import LeaveOneOut
from rebalancedcv import RebalancedLeaveOneOut
from sklearn.metrics import roc_auc_score
np.random.seed(1)
## given some random `X` matrix, and a `y` binary vector
X = np.random.rand(100, 10)
y = np.random.rand(100) > 0.5
## Leave-one-out evaluation
loo = LeaveOneOut()
loocv_predictions = [ LogisticRegressionCV()\
.fit(X[train_index], y[train_index])\
.predict_proba(X[test_index]
)[:, 1][0]
for train_index, test_index in loo.split(X, y) ]
## Since all the data is random, a fair evaluation
## should yield au auROC close to 0.5
print('Leave One Out auROC: {:.2f}'\
.format( roc_auc_score(y, loocv_predictions) ) )
## Rebalanced leave-one-out evaluation
rloo = RebalancedLeaveOneOut()
rloocv_predictions = [ LogisticRegressionCV()\
.fit(X[train_index], y[train_index])\
.predict_proba(X[test_index]
)[:, 1][0]
for train_index, test_index in rloo.split(X, y) ]
## Since all the data is random, a fair evaluation
## should yield au auROC close to 0.5
print('Rebalanceed Leave-one-out auROC: {:.2f}'\
.format( roc_auc_score(y, rloocv_predictions) ) )
Leave One Out auROC: 0.00
Rebalanceed Leave-one-out auROC: 0.48
As demonstrated in this example, neglecting to account for distributional bias in the cross-validation classes can greatly decrease evaluated model performance. For more details on why this happens, please refer to our manuscript.
We note that the example’s code structure approach would apply to
this package’s other RebalancedKFold
,
RebalancedLeavePOut
, and
RebalancedLeaveOneOutRegression
classes.
All classes from this package provide train/test indices to split
data in train/test sets while rebalancing the training set to account
for distributional bias. This package is designed to enable automated
rebalancing for the cross-validation implementations in scikit-learn’s
LeaveOneOut
, StratifiedKFold
, and
LeavePOut
, through the RebalancedCV
classes
RebalancedLeaveOneOut
, RebalancedKFold
,
RebalancedLeavePOut
, and
RebalancedLeaveOneOutRegression
. These Rebalanced classes
are designed to work in the exact same code structure and implementation
use cases as their scikit-learn equivalents, with the only difference
being a subsampling within the provided training indices.
For more details, we provide further background and walkthroughs for each of these three classes in their respective documentation tabs.
Austin, G.I. et al. “Distributional bias compromises leave-one-out cross-validation” (2024). https://arxiv.org/abs/2406.01652