DEBIAS-M: Domain adaptation with phenotype Estimation and Batch Integration Across Studies of the Microbiome

Welcome to DEBIAS-M! This is a python package for processing bias correction in microbiome studies, that facilitates data harmonization, domain adaptation, predictive modeling, and batch correction.

DEBIAS-M is designed to work across multiple microbiome studies, and has a variety of versions depending on applications:

DebiasMClassifier - classification
DebiasMRegressor - regression
MultitaskDebiasMClassifier - multitask classification
MultitaskDebiasMRegressor - multitask regression
OnlineDebiasMClassifier - online adaptation for classification
DebiasMClassifierLogAdd - bias correction in logarithmic space

For any support using DEBIAS-M, please use our issues page or email gia2105@columbia.edu.

Installation

pip install DEBIAS-M

The dependencies for installation are python, scikit-learn, torch, and pytorch-lightning

Loading

from debiasm import DebiasMClassifier, DebiasMRegressor, MultitaskDebiasMClassifier, MultitaskDebiasMRegressor, OnlineDebiasMClassifier, DebiasMClassifierLogAdd

Background

It has been shown that each experimental processing and bioinformatic step in microbiome analysis pipelines has distinct biases that multiplicatively over- and understate the relative abundance of various microbes. As a result, one can find significant differences even when the same sample is processed in different protocols. This would typically manifest as strong study- and batch-effects. DEBIAS-M addresses this issue using a model of multiplicative taxon-specific biases combined with 2 key assumption: (1) the average microbiome composition of different batches/studies should be similar; (2) there is an underlying association between the microbiome and provided phenotype that is diminished by bias and will become stronger when bias is corrected. For more details, please refer to our manuscript, where we show that DEBIAS-M improves batch correction and predictive modeling across microbiome studies, and changes the underlying data using explainable and quasi-mechanistic parameters.

The inputs, outputs, and syntax within the DEBIAS-M package are built using a scikit-learn base, and are designed to have similar functionality to standard scikit-learn classes, such as dmc.fit(X, Y), dmc.transform(X), and dmc.predict_proba(X). For example:

## import packages
import numpy as np
from sklearn.metrics import roc_auc_score
from debiasm import DebiasMClassifier

## generate data for the example
np.random.seed(123)
n_samples = 96*5
n_batches = 5
n_features = 100

## the read count matrix
X = ( np.random.rand(n_samples, n_features) * 1000 ).astype(int)

## the labels
y = np.random.rand(n_samples)>0.5

## the batches
batches = ( np.random.rand(n_samples) * n_batches ).astype(int)

## we assume the batches are numbered ints starting at '0',
## and they are in the first column of the input X matrices
X_with_batch = np.hstack((batches[:, np.newaxis], X))
## set the valdiation batch to '4'
val_inds = batches==4
X_train, X_val = X_with_batch[~val_inds], X_with_batch[val_inds]
y_train, y_val = y[~val_inds], y[val_inds]

### Run DEBIAS-M, using standard sklearn object methods
dmc = DebiasMClassifier(x_val=X_val) ## give it the held-out inputs to account for
                                    ## those domains shifts while training
dmc.fit(X_train, y_train)

## Assess results
### should be ~~0.5 in this example , since the data is all random
roc_auc_score(y_val, dmc.predict_proba(X_val)[:, 1]) 

## extract the 'DEBIAS-ed' data for other downstream analyses, if applicable 
X_debiassed = dmc.transform(X_with_batch)

For more details, we provide further background and walkthrough demonstrations for each of DEBIAS-M’s classes in their respective documentation tabs.

Citation

Austin, G.I., Brown Kav, A., ElNaggar, S. et al. Processing-bias correction with DEBIAS-M improves cross-study generalization of microbiome-based prediction models. Nat Microbiol (2025). https://doi.org/10.1038/s41564-025-01954-4