rebalancedcv.RebalancedKFold

class rebalancedcv.RebalancedKFold(n_splits=5, shuffle=False, random_state=None)


Description

Stratified K-Fold cross-validator with rebalancing.

Provides train/test indices to split data in train/test sets, with sub-sampling within the training set to ensure that all training folds have identical class balances.

This class is designed to have the same functionality and implementation structure as scikit-learn’s StratifiedKFold.


Parameters

  • n_splits : int, default=5
    • Number of folds. Must be at least 2.

  • shuffle : bool, default=False
    • Whether to shuffle each class’s samples before splitting into batches. Note that the samples within each split will not be shuffled.

  • random_state : int, RandomState instance or None, default=None
    • When shuffle is True, random_state affects the ordering of the indices, which controls the randomness of each fold for each class. Otherwise, leave random_state as None. Pass an int for reproducible output across multiple function calls. See :term:Glossary <random_state>.

These parameters are designed to match the structure and functionality of scikit-learn’s StratifiedKFold.

Example

### Observing the indices on a small example dataset
import numpy as np
from rebalancedcv import RebalancedKFold

X = np.array([[1, 2, 1, 2, 1], [3, 4, 3, 4, 3]]).T
y = np.array([1, 2, 1, 2, 1])
rloo = RebalancedKFold(n_splits=2)
rloo.get_n_splits(X, y)
for i, (train_index, test_index) in enumerate(rloo.split(X, y)):
    print(f"Fold {i}:")
    print(f"  Train: index={train_index}")
    print(f"  Test:  index={test_index}")
Fold 0:
  Train: index=[3 4]
  Test:  index=[0 1 2]
Fold 1:
  Train: index=[0 1]
  Test:  index=[3 4]


Methods

The methods of the RebalancedLeavePOut class are designed to enable identical funcitonality to scikit-learn’s LeavePOut.

split(X, y, groups=None, seed=None)
  • Generate indices to split data into training and test set.

    • Parameters
      • X : array-like of shape (n_samples, n_features)
      • Training data, where n_samples is the number of samples and n_features is the number of features.
      • y : array-like of shape (n_samples,)
        • The target variable for supervised learning problems.
      • groups : array-like of shape (n_samples,), default=None
        • Group labels for the samples used while splitting the dataset into train/test set.
      • seed : int, default=None
        • If provided, is used to set a seed in subsampling

    • Yields
      • train : ndarray
        • The training set indices for that split.
      • test : ndarray
        • The testing set indices for that split.
See also:
RebalancedLeaveOneOut
       Leave-one-out iterator with training set rebalancing
RebalancedLeavePOut
       Leave-P-out iterator with training set rebalancing

For more background on Stratified K-Fold, refer to the scikit-learn User Guide.