debiasm.DebiasMClassifierLogAdd


class debiasm.DebiasMClassifierLogAdd(batch_str = ‘infer’,
                       learning_rate=0.005,
                       min_epochs=25,
                       l2_strength=0,
                       w_l2=0,
                       random_state=None,
                       x_val=None,
                       prediction_loss=torch.nn.functional.binary_cross_entropy
                      )


The DEBIAS-M Classifier implementation for logspace inputs.

This class implements additive DEBIAS-M bias-correction, which models the processing-bias mechanism in logarithmic space representations of read counts, such as the center log ratio transform.

This uses a microbiome n_samples n_taxa of some logarithmic processed read count matrices from multiple X samples, along with a provided binary y label.

The ‘batch_str’ parameter weights the strength of the enforced cross-batch similarity, ‘l2_strength’ for an l2 regularization of the predictive parameters, and ‘w_l2’ for an l2 regularization of the bias-correction parameters. ‘x_val’ corresponds to microbiome inputs for a held-out set, for which the y labels are unavailable.


Parameters

  • batch_str: {‘infer’ or float}, default=‘infer’
    • The weight of the enforced cross-batch similarity. Selecting ‘infer’ automatically selects the weight inversely proportional to the number of pairs of batches, and the number of taxa in the input matrix. Larger values specify stronger regularization.

  • learning_rate: float, default=0.005
    • The learning rate used during the DEBIAS-M model convergence.

  • min_epochs: int, default=25
    • The minimum number of epochs completed during training.

  • l2_strength: float, default=0
    • The l2 regularization of the linear predictive layer’s parameters. Larger values specify stronger regularization.

  • w_l2: float, default=0
    • The l2 regularization of the multiplicative bias correction parameters (applied to the logarithm of the multiplicative parameters). Larger values specify stronger regularization.

  • random_state: int, default=None
    • Used to specify the seed during training, if specified.

  • x_val: {array-like, sparse matrix} of shape (n_samples, 1 + n_taxa), default=None
    • An n_samples x 1 + n_taxa matrix describing the log-processed read counts of held-out validation and/or test sets, for which any validation or testing labels will not be available during training. The first column of x_val denotes the batch of each sample, as non-negative integers which are interpreted alongside batches specified in the train inputs. Providing x_val allows DEBIAS-M to account for distribution shifts from these samples during training.

  • prediction_loss: loss function, default=torch.nn.functional.binary_cross_entropy
    • Used to specify the prediction loss function to be used during training.



Example

## import packages
import numpy as np
from sklearn.metrics import roc_auc_score
from skbio.stats.composition import clr
from debiasm import DebiasMClassifierLogAdd


## generate data for the example
np.random.seed(123)
n_samples = 96*5
n_batches = 5
n_features = 100

## the read count matrix, with a pseudocount
X = 1 + ( np.random.rand(n_samples, n_features) * 1000 ).astype(int)

## map into relative abundance, then center log ratio space
X = clr( X / X.sum(axis=1)[:, np.newaxis] )

## the labels
y = np.random.rand(n_samples)>0.5

## the batches
batches = ( np.random.rand(n_samples) * n_batches ).astype(int)

## we assume the batches are numbered ints starting at '0',
## and they are in the first column of the input X matrices
X_with_batch = np.hstack((batches[:, np.newaxis], X))
## set the valdiation batch to '4'
val_inds = batches==4
X_train, X_val = X_with_batch[~val_inds], X_with_batch[val_inds]
y_train, y_val = y[~val_inds], y[val_inds]

### Run DEBIAS-M, using standard sklearn object methods
dmc = DebiasMClassifierLogAdd(x_val=X_val) ## give it the held-out inputs to account for
                                    ## those domains shifts while training
dmc.fit(X_train, y_train)

## Assess results
### should be ~~0.5 in this example , since the data is all random
roc_auc_score(y_val, dmc.predict_proba(X_val)[:, 1]) 

## extract the 'DEBIAS-ed' data for other downstream analyses, if applicable 
X_debiassed = dmc.transform(X_with_batch)


Methods

  • fit(X, y)

    • Fit the model according to the given training data.

      • Parameters:
        • X : {array-like, sparse matrix} of shape (n_samples, 1 + n_taxa)
          • Training samples, where n_samples is the number of samples and n_taxa is the number of taxa. The first column of X denotes the batch of each sample, as non-negative integers, while the remaining n_taxa describe log-processed output of each taxon. DEBIAS-M also supports relative abundance inputs.
        • y : array-like of shape (n_samples,)
          • Target vector relative to X.

      • Returns:
        • self
          • Fitted DEBIAS-M preprocessor and estimator
  • transform(X)

    • Apply DEBIAS-M processing to X.

      • Parameters:
        • X : {array-like, sparse matrix} of shape (n_samples, 1 + n_taxa)
          • Log-space samples to be transformed; n_samples is the number of samples and n_taxa is the number of taxa. The first column of X denotes the batch of each sample, as non-negative integers, while the remaining n_taxa describe the log-processed output of each taxon. DEBIAS-M also supports relative abundance inputs.

      • Returns:
        • X_debias
          • matrix of shape (n_samples, n_taxa), of the relative abundance matrix of X following bias-correction
  • predict_proba(X)

    • Calculate DEBIAS-M classification probability estimates; the returned estimates for all classes are ordered by the label of classes.

      • Parameters:
        • X : {array-like, sparse matrix} of shape (n_samples, 1 + n_taxa)
          • Log-space samples to obtain predictions for; n_samples is the number of samples and n_taxa is the number of taxa. The first column of X denotes the batch of each sample, as non-negative integers, while the remaining n_taxa describe the log-processed output of each taxon. DEBIAS-M also supports relative abundance inputs.

      • Returns:
        • T : array-like of shape (n_samples, n_classes)
          • The probability of the sample for each class in the model

See also:
DEBIAS-M Regression
       Te DEBIAS-M regressor
DebiasMClassifier
       Implementation of a DEBIAS-M regressor

For more background on DEBIAS-M, refer to our manuscript.