rebalancedcv.RebalancedKFold

class rebalancedcv.RebalancedKFold(n_splits=5, shuffle=False, random_state=None)

Description

Stratified K-Fold cross-validator with rebalancing.

Provides train/test indices to split data in train/test sets, with sub-sampling within the training set to ensure that all training folds have identical class balances.

This class is designed to have the same functionality and implementation structure as scikit-learn’s StratifiedKFold.

Parameters

n_splits : int, default=5
- Number of folds. Must be at least 2.
shuffle : bool, default=False
- Whether to shuffle each class’s samples before splitting into batches. Note that the samples within each split will not be shuffled.
random_state : int, RandomState instance or None, default=None
- When shuffle is True, random_state affects the ordering of the indices, which controls the randomness of each fold for each class. Otherwise, leave random_state as None. Pass an int for reproducible output across multiple function calls. See :term:Glossary <random_state>.

These parameters are designed to match the structure and functionality of scikit-learn’s StratifiedKFold.

Example

### Observing the indices on a small example dataset
import numpy as np
from rebalancedcv import RebalancedKFold

X = np.array([[1, 2, 1, 2, 1], [3, 4, 3, 4, 3]]).T
y = np.array([1, 2, 1, 2, 1])
rloo = RebalancedKFold(n_splits=2)
rloo.get_n_splits(X, y)
for i, (train_index, test_index) in enumerate(rloo.split(X, y)):
    print(f"Fold {i}:")
    print(f"  Train: index={train_index}")
    print(f"  Test:  index={test_index}")

Fold 0:
  Train: index=[3 4]
  Test:  index=[0 1 2]
Fold 1:
  Train: index=[0 1]
  Test:  index=[3 4]

Methods

The methods of the RebalancedLeavePOut class are designed to enable identical funcitonality to scikit-learn’s LeavePOut.

split(X, y, groups=None, seed=None)

Generate indices to split data into training and test set.
- Parameters
  - X : array-like of shape (n_samples, n_features)
  - Training data, where n_samples is the number of samples and n_features is the number of features.
  - y : array-like of shape (n_samples,)
    - The target variable for supervised learning problems.
  - groups : array-like of shape (n_samples,), default=None
    - Group labels for the samples used while splitting the dataset into train/test set.
  - seed : int, default=None
    - If provided, is used to set a seed in subsampling
- Yields
  - train : ndarray
    - The training set indices for that split.
  - test : ndarray
    - The testing set indices for that split.

See also:
RebalancedLeaveOneOut
Leave-one-out iterator with training set rebalancing
RebalancedLeavePOut
Leave-P-out iterator with training set rebalancing

For more background on Stratified K-Fold, refer to the scikit-learn User Guide.