This is the reComBat implementation as described in our recent paper. The paper introduces a generalized version of the empirical Bayes batch correction method introduced in [1]. We use the two-design-matrix approach of Wachinger et al. [2]
reComBat is a PyPI package which can be installed via pip:
pip install reComBat
You can also clone the repository and install it locally via Poetry by executing
poetry installin the repository directory.
The reComBat package is inspired by the code of [3] and also uses a scikit-learn like
API.
In a Python script, you can import it via
from reComBat import reComBat
combat = reComBat()
combat.fit(data,batches)
combat.transform(data,batches)or
combat.fit_transform(data,batches)All data input (data, batches, design matrices) are input as pandas dataframes. The format is (rows x columns) = (samples x features), and the index is an arbitrary sample index. The batches should be given as a pandas series. Note that there are two types of columns for design matrices, numerical columns and categorical columns. All columns in X and C are by default assumed categorical. If a column contains numerical covariates, these columns should have the suffix "_numerical" in the column name.
There is also a command-line interface which can be called from a bash shell.
reComBat data_file.csv batch_file.csv --<optional args>The reComBat class has many optional arguments (see below).
The fit, transform and fit_transform functions all take pandas dataframes as arguments,
data and batches. Both dataframes should be in the form above.
The reComBat class has the following optional arguments:
parametric:TrueorFalse. Choose between the parametric or non-parametric version of the empirical Bayes method. By default, this isTrue, i.e. the parametric method is performed. Note that the non-parametric method has a longer run time than the parametric one.model: Choose which regression model should be used to standardise the data. You can choose betweenlinear,ride,lassoandelastic_netregression. By default theelastic_netmodel is used.config: A Python dictionary specifying the keyword arguments for the relevantscikit-learnregression classes.
For example, the LinearRegression class in scikit-learn currently has four non-deprecated keyword arguments, fit_intercept, copy_X, n_jobs, and positive. To specify each of them, we create a config dict
config = {'fit_intercept':False,'copy_X':True,'n_jobs':1,'positive':False}Note that in order for reComBat to give the correct result, the fit_intercept parameter always needs to be set to False.
For further details refer to sklearn.linear_model. The default config is None.
conv_criterion: The convergence criterion for the parametric empirical Bayes optimization. Relative, rather than absolute convergence criteria are used. The default is 1e-4.max_iter: The maximum number of iterations for the parametric empirical Bayes optimization. The default is 1000.n_jobs: The number of parallel thread used in the non-parametric empirical Bayes optimization. A larger number of threads considerably speeds up the computation, but also has higher memory requirements. The default is the number of CPUs of the machine.mean_only:TrueorFalse. Chooses whether the only the means are adjusted (no scaling is performed), or the full algorithm should be run. The default isFalse.optimize_params:TrueorFalse. Chooses whether the Bayesian parameters should be optimised, or if the starting values should be used. The default isTrue.reference_batch: If the data contains a reference batch, then this can be specified here. The reference batch will not be adjusted. The default isNone.verbose:TrueorFalse. Toggles verbose output. The default isTrue.
The command line interface can take any of these arguments (except for config) via --<argument>=ARG. Any scikit-learn keyword arguments should be given explicitly, e.g. --alpha=1e-10. The command line interface has the additional following optional arguments:
X_file: The csv file containing the design matrix of desired variation. The default isNone.C_file: The csv file containing the design matrix of undesired variation. The default isNone.data_path: The path to the data/design matrices. The default is the current directory.out_path: The path where the output file should be stored. The default is the current directory.out_file: The name out the output file (with extension).
The transform method and the command line interface output a dataframe, respectively a csv file, of the form (samples x features) with the adjusted data.
We included a step-by-step tutorial in the tutorial folder of the GitHub repository. We also provide a PDF version which serves as a manual.
This code is developed and maintained by members of the Machine Learning and Computational Biology Lab of Prof. Dr. Karsten Borgwardt:
- Michael Adamer (GitHub)
- Sarah Brüningk (GitHub)
References:
[1] W. Evan Johnson, Cheng Li, Ariel Rabinovic, Adjusting batch effects in microarray expression data using empirical Bayes methods, Biostatistics, Volume 8, Issue 1, January 2007, Pages 118–127, https://doi.org/10.1093/biostatistics/kxj037
[2] Christian Wachinger, Anna Rieckmann, Sebastian Pölsterl. Detect and Correct Bias in Multi-Site Neuroimaging Datasets. arXiv:2002.05049
[3] pycombat, CoAxLab, https://github.com/CoAxLab/pycombat