debiasm.DebiasMRegressor


class debiasm.DebiasMRegressor(batch_str = ‘infer’,
                   learning_rate=0.005,
                   min_epochs=25,
                   l2_strength=0,
                   w_l2=0,
                   random_state=None,
                   x_val=None,
                   prediction_loss=torch.nn.functional.mse_loss
                   )


The DEBIAS-M Regressor.

This class implements multiplicative bias-correction via DEBIAS-M for regression. It received as input an X matrix of n_samples n_taxa read count or relative abundancees from multiple microbiome samples, along with a continuous regression target y.

The ‘batch_str’ parameter weights the strength of the enforced cross-batch similarity, ‘l2_strength’ for an l2 regularization of the predictive parameters, and ‘w_l2’ for an l2 regularization of the bias-correction parameters. ‘x_val’ corresponds to microbiome inputs for a held-out set, for which the y values are unavailable.


Parameters

  • batch_str: {‘infer’ or float}, default=‘infer’
    • The weight of the enforced cross-batch similarity. Selecting ‘infer’ automatically selects the weight inversely proportional to the number of pairs of batches, and the number of taxa in the input matrix. Larger values specify stronger regularization.

  • learning_rate: float, default=0.005
    • The learning rate used during the DEBIAS-M model convergence.

  • min_epochs: int, default=25
    • The minimum number of epochs completed during training.

  • l2_strength: float, default=0
    • The l2 regularization of the linear predictive layer’s parameters. Larger values specify stronger regularization.

  • w_l2: float, default=0
    • The l2 regularization of the multiplicative bias correction parameters (applied to the logarithm of the multiplicative parameters). Larger values specify stronger regularization.

  • random_state: int, default=None
    • Used to specify the seed during training, if specified.

  • x_val: {array-like, sparse matrix} of shape (n_samples, 1 + n_taxa), default=None
    • An n_samples x 1 + n_taxa matrix describing the read counts of held-out validation and/or test sets, for which any validation or testing labels will not be available during training. The first column of x_val denotes the batch of each sample, as non-negative integers which are interpreted alongside batches specified in the train inputs. Providing x_val allows DEBIAS-M to account for distribution shifts from these samples during training.

  • prediction_loss: loss function, default=torch.nn.functional.mse_loss
    • Used to specify the prediction loss function to be used during training.



Example

## import packages
import numpy as np
from sklearn.metrics import r2_score
from debiasm import DebiasMRegressor

## generate data for the example
np.random.seed(123)
n_samples = 96*5
n_batches = 5
n_features = 100

## the read count matrix
X = ( np.random.rand(n_samples, n_features) * 1000 ).astype(int)

## the labels
y = np.random.rand(n_samples)

## the batches
batches = ( np.random.rand(n_samples) * n_batches ).astype(int)

## we assume the batches are numbered ints starting at '0',
## and they are in the first column of the input X matrices
X_with_batch = np.hstack((batches[:, np.newaxis], X))
## set the valdiation batch to '4'
val_inds = batches==4
X_train, X_val = X_with_batch[~val_inds], X_with_batch[val_inds]
y_train, y_val = y[~val_inds], y[val_inds]

### Run DEBIAS-M, using standard sklearn object methods
dmc = DebiasMRegressor(x_val=X_val) ## give it the held-out inputs to account for
                                    ## those domains shifts while training
dmc.fit(X_train, y_train)

## Assess results
### should be ~~0 in this example , since the data is all random
r2_score(y_val, dmc.predict(X_val)) 

## extract the 'DEBIAS-ed' data for other downstream analyses, if applicable 
X_debiassed = dmc.transform(X_with_batch)


Methods

  • fit(X, y)

    • Fit the model according to the given training data.

      • Parameters:
        • X : {array-like, sparse matrix} of shape (n_samples, 1 + n_taxa)
          • Training samples, where n_samples is the number of samples and n_taxa is the number of taxa. The first column of X denotes the batch of each sample, as non-negative integers, while the remaining n_taxa describe the read counts of each taxon. DEBIAS-M also supports relative abundance inputs.
        • y : array-like of shape (n_samples,)
          • Target values relative to X.

      • Returns:
        • self
          • Fitted DEBIAS-M preprocessor and estimator
  • transform(X)

    • Apply DEBIAS-M processing to X.

      • Parameters:
        • X : {array-like, sparse matrix} of shape (n_samples, 1 + n_taxa)
          • Samples to be transformed; n_samples is the number of samples and n_taxa is the number of taxa. The first column of X denotes the batch of each sample, as non-negative integers, while the remaining n_taxa describe the read counts of each taxon. DEBIAS-M also supports relative abundance inputs.

      • Returns:
        • X_debias : matrix of shape (n_samples, n_taxa)
          • The relative abundance matrix of X following bias-correction
  • predict_proba(X)

    • Predict using the bias-corrected samples and the fitted DEBIAS-M linear model.

      • Parameters:
        • X : {array-like, sparse matrix} of shape (n_samples, 1 + n_taxa)
          • Samples to obtain predictions for; n_samples is the number of samples and n_taxa is the number of taxa. The first column of X denotes the batch of each sample, as non-negative integers, while the remaining n_taxa describe the read counts of each taxon. DEBIAS-M also supports relative abundance inputs.

      • Returns:
        • y_pred : array-like of shape (n_samples, )
          • The predicted values for each sample.

See also:
MultitaskDebiasMRegressor
       The multitask DEBIAS-M regressor
MultitaskDebiasMClassifier
       Implementation of the multitask DEBIAS-M classifier

For more background on DEBIAS-M, refer to our manuscript.