debiasm.MultitaskDebiasMClassifier

class debiasm.MultitaskDebiasMClassifier(batch_str = ‘infer’,
learning_rate=0.005,
min_epochs=25,
l2_strength=0,
w_l2=0,
random_state=None,
x_val=None,
prediction_loss=torch.nn.functional.binary_cross_entropy
)

The Multitask DEBIAS-M Classifier.

This class implements multiplicative bias-correction via DEBIAS-M for a multitask classifier. It received as input an X matrix of n_samples n_taxa read count or relative abundancees from multiple microbiome samples, along with binary y labels for at least two tasks.

The ‘batch_str’ parameter weights the strength of the enforced cross-batch similarity, ‘l2_strength’ for an l2 regularization of the predictive parameters, and ‘w_l2’ for an l2 regularization of the bias-correction parameters. ‘x_val’ corresponds to microbiome inputs for a held-out set, for which the y labels are unavailable.

Parameters

batch_str: {‘infer’ or float}, default=‘infer’
- The weight of the enforced cross-batch similarity. Selecting ‘infer’ automatically selects the weight inversely proportional to the number of pairs of batches, and the number of taxa in the input matrix. Larger values specify stronger regularization.
learning_rate: float, default=0.005
- The learning rate used during the DEBIAS-M model convergence.
min_epochs: int, default=25
- The minimum number of epochs completed during training.
l2_strength: float, default=0
- The l2 regularization of the linear predictive layer’s parameters. Larger values specify stronger regularization.
w_l2: float, default=0
- The l2 regularization of the multiplicative bias correction parameters (applied to the logarithm of the multiplicative parameters). Larger values specify stronger regularization.
random_state: int, default=None
- Used to specify the seed during training, if specified.
x_val: {array-like, sparse matrix} of shape (n_samples, 1 + n_taxa), default=None
- An n_samples x 1 + n_taxa matrix describing the read counts of held-out validation and/or test sets, for which any validation or testing labels will not be available during training. The first column of x_val denotes the batch of each sample, as non-negative integers which are interpreted alongside batches specified in the train inputs. Providing x_val allows DEBIAS-M to account for distribution shifts from these samples during training.
prediction_loss: loss function, default=torch.nn.functional.binary_cross_entropy
- Used to specify the prediction loss function to be used during training.

Example

## import packages
import numpy as np
from sklearn.metrics import roc_auc_score
from debiasm import MultitaskDebiasMClassifier

## generate data for the example
np.random.seed(123)
n_samples = 96*5
n_batches = 5
n_features = 100
n_tasks=3

## the read count matrix
X = ( np.random.rand(n_samples, n_features) * 1000 ).astype(int)

## the labels
y = np.random.rand(n_samples, n_tasks)>0.5

## the batches
batches = ( np.random.rand(n_samples) * n_batches ).astype(int)

## we assume the batches are numbered ints starting at '0',
## and they are in the first column of the input X matrices
X_with_batch = np.hstack((batches[:, np.newaxis], X))
## set the valdiation batch to '4'
val_inds = batches==4
X_train, X_val = X_with_batch[~val_inds], X_with_batch[val_inds]
y_train, y_val = y[~val_inds], y[val_inds]

### Run multitask DEBIAS-M, using standard sklearn object methods
multitask_model = MultitaskDebiasMClassifier(x_val=X_val)
multitask_model.fit(X_train, y_train)

## Assess resulting scores
predicted_scores = multitask_model.predict_proba(X_val)

## extract the 'DEBIAS-ed' data for other downstream analyses, if applicable 
X_debiassed = multitask_model.transform(X_with_batch)

Methods

fit(X, y)
- Fit the model according to the given training data.
- - Parameters:
    - X : {array-like, sparse matrix} of shape (n_samples, 1 + n_taxa)
      - Training samples, where n_samples is the number of samples and n_taxa is the number of taxa. The first column of X denotes the batch of each sample, as non-negative integers, while the remaining n_taxa describe the read counts of each taxon. DEBIAS-M also supports relative abundance inputs.
    - y : array-like of shape (n_samples, n_tasks)
      - Target vectors relative to X, where n_tasks represents the number of training tasks.
  - Returns:
    - self
      - Fitted DEBIAS-M preprocessor and estimator

transform(X)
- Apply DEBIAS-M processing to X.
- - Parameters:
    - X : {array-like, sparse matrix} of shape (n_samples, 1 + n_taxa)
      - Samples to be transformed; n_samples is the number of samples and n_taxa is the number of taxa. The first column of X denotes the batch of each sample, as non-negative integers, while the remaining n_taxa describe the read counts of each taxon. DEBIAS-M also supports relative abundance inputs.
  - Returns:
    - X_debias : matrix of shape (n_samples, n_taxa)
      - The relative abundance matrix of X following bias-correction

predict_proba(X)
- Calculate DEBIAS-M classification probability estimates; the returned estimates for all classes are ordered by the label of classes.
- - Parameters:
    - X : {array-like, sparse matrix} of shape (n_samples, 1 + n_taxa)
      - Samples to obtain predictions for; n_samples is the number of samples and n_taxa is the number of taxa. The first column of X denotes the batch of each sample, as non-negative integers, while the remaining n_taxa describe the read counts of each taxon. DEBIAS-M also supports relative abundance inputs.
  - Returns:
    - P : list of n_tasks ndarrays, each of shape (n_samples, n_classes)
      - The probabilities of each sample for each task and class considered in the model.

See also:
Multitask DEBIAS-M Classifier Demo
The DEBIAS-M regressor
OnlineDebiasMClassifier
DEBIAS-M for online corrections

For more background on DEBIAS-M, refer to our manuscript.