Concatenating multimodal experiments

Concatenating multimodal experiments#

import warnings

import anndata as ad
import numpy as np
import pandas as pd
from mudata import MuData

warnings.simplefilter(action="ignore", category=FutureWarning)


Sometimes, you may want to concatenate 2 MuData objects because they represent complementary slices of the same dataset on which you have applied different processing. Think of analysing B and T cells separately for your PBMC typical dataset. Other times instead you need to concatenate 2 modalities into one AnnData because the tool you’re working with doesn’t currently support MuData (yeah we know, how dare they?). We will showcase here these 2 scenarios of concatenation.


Native concatenation of two MuData objects is currently discussed in scverse/mudata#20 and may eventually make parts of this tutorial obsolete.

Note that for some modalities, concatenation requires extra care. For instance, in the case of ATAC-seq, concatenation does not make sense unless fragments are aggregated first.

First, we need to import the raw data for a dataset of our choice. We use mudatasets package that conveniently collects some useful 10X single cell datasets that are publicly available. For this example we need a multimodal dataset, so select the citeseq 5k dataset, a collection of healthy PBMCs for which 2 modalities were profiled, RNA and PROTEINS.

import mudatasets as mds

pbmc5k = mds.load("pbmc5k_citeseq", files=["filtered_feature_bc_matrix.h5"])
■ File filtered_feature_bc_matrix.h5 from pbmc5k_citeseq has been found at /home/runner/mudatasets/pbmc5k_citeseq/filtered_feature_bc_matrix.h5
■ Checksum is validated (md5) for filtered_feature_bc_matrix.h5
■ Loading filtered_feature_bc_matrix.h5...
MuData object with n_obs × n_vars = 5247 × 33570
  var:	'gene_ids', 'feature_types', 'genome'
  2 modalities
    rna:	5247 x 33538
      var:	'gene_ids', 'feature_types', 'genome'
    prot:	5247 x 32
      var:	'gene_ids', 'feature_types', 'genome'

We create 2 different subsamples of the same underlying data for both RNA and PROT modalities.

rna = pbmc5k.mod["rna"]
prot = pbmc5k.mod["prot"]
rna_a = rna[np.arange(300), np.sort(np.random.choice(np.arange(1000), 1000, replace=False))].copy()
prot_a = prot[rna_a.obs_names,].copy()

rna_b = rna[np.arange(500, 900), np.sort(np.random.choice(np.arange(3000), 1000, replace=False))].copy()
prot_b = prot[rna_b.obs_names, np.arange(15)].copy()

And we create the respective MuData objects.

mdata_a = MuData({"prot": prot_a, "rna": rna_a})
mdata_b = MuData({"prot": prot_b, "rna": rna_b})
MuData object with n_obs × n_vars = 300 × 1032
  var:	'gene_ids', 'feature_types', 'genome'
  2 modalities
    prot:	300 x 32
      var:	'gene_ids', 'feature_types', 'genome'
    rna:	300 x 1000
      var:	'gene_ids', 'feature_types', 'genome'
MuData object with n_obs × n_vars = 400 × 1015
  var:	'gene_ids', 'feature_types', 'genome'
  2 modalities
    prot:	400 x 15
      var:	'gene_ids', 'feature_types', 'genome'
    rna:	400 x 1000
      var:	'gene_ids', 'feature_types', 'genome'

as you see, the 2 RNA subsamples don’t share any cells, but they share some features. It’s the same for the PROT assay.

len(list(set(rna_a.obs_names.tolist()) & set(rna_b.obs_names.tolist())))
len(list(set(rna_a.var_names.tolist()) & set(rna_b.var_names.tolist())))
len(list(set(prot_a.var_names.tolist()) & set(prot_b.var_names.tolist())))

1. Concatenate datasets, by modality#

In the AnnData convention, we store observations (samples or cells) in rows (axis=0)and variables (genes, proteins, atac regions, etc …) in columns (axis=1). Both the rows and columns of this matrix are indexed, which allows us to link between each other the structured layers of the AnnData object.

When we interact with both axes of these matrices, we modify the same axes on all the linked layers.

In scRNA-seq data, each row corresponds to a cell with a barcode, and each column corresponds to a gene with a gene id, but in the protein assay of a CITEseq experiment the cells are the same along the axis=0 and the features are different.

To collect all the cells and features from 2 datasets we first have to concatenate each anndata and then build a new mudata with these.

By default, anndata concatenates on axis=0

ad.concat([rna_a, rna_b])
AnnData object with n_obs × n_vars = 700 × 345
ad.concat([rna_a, rna_b], axis=0)
AnnData object with n_obs × n_vars = 700 × 345

You may have noticed that anndata also defaults to create a concatenated version of the 2 RNA subsets with only the features that the 2 matrices have in common. This is the default scenario obtained by setting the parameter join="inner".

There may be instances in which you don’t want to lose the features that are missing from one of the 2 RNA, so let’s try setting join="outer"

ad.concat([rna_a, rna_b], axis=0, join="outer")
AnnData object with n_obs × n_vars = 700 × 1655

Anndata is also filling the variables that don’t match with 0, instead of na values.

NB since axis=0 is the default behaviour, we will omit it in the future calls of the concat command for simplicity

ad.concat([rna_a, rna_b], join="outer").X.toarray()
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)
np.isnan(ad.concat([rna_a, rna_b], join="outer").X.toarray()).any()

We can use the same convention to concatenate the two protein assays.

rna_c = ad.concat([rna_a, rna_b], join="outer")
prot_c = ad.concat([prot_a, prot_b], join="outer")

And now we create the new MuData object with the newly concatenated assays

full = MuData({"rna": rna_c, "prot": prot_c})
MuData object with n_obs × n_vars = 700 × 1687
  2 modalities
    rna:	700 x 1655
    prot:	700 x 32

2. Concatenating different modalities#

You may want to concatenate the RNA and the PROT modalities of the same cells. While we don’t recommend using this type of concatenation, because we believe that every basic operation you would want to perform on a multimodal object is covered by creating a MuData object instead, we know that some of the tools that deal with multimodal data integration have not implemented MuData support yet.

AnnData object with n_obs × n_vars = 300 × 1000
    var: 'gene_ids', 'feature_types', 'genome'
AnnData object with n_obs × n_vars = 300 × 32
    var: 'gene_ids', 'feature_types', 'genome'
adata_paired = ad.concat([rna_a, prot_a], axis=1)
AnnData object with n_obs × n_vars = 300 × 1032
    var: 'gene_ids', 'feature_types', 'genome'

we now have a concatenated anndata, whith 1032 .var and 600 .obs. Let’s take a look at the individual layers.


300 rows × 0 columns

gene_ids feature_types genome
MIR1302-2HG ENSG00000243485 Gene Expression GRCh38
FAM138A ENSG00000237613 Gene Expression GRCh38
OR4F5 ENSG00000186092 Gene Expression GRCh38
AL627309.1 ENSG00000238009 Gene Expression GRCh38
AL627309.3 ENSG00000239945 Gene Expression GRCh38
... ... ... ...
HLA-DR_TotalSeqB HLA-DR Antibody Capture
TIGIT_TotalSeqB TIGIT Antibody Capture
IgG1_control_TotalSeqB IgG1 Antibody Capture
IgG2a_control_TotalSeqB IgG2a Antibody Capture
IgG2b_control_TotalSeqB IgG2b Antibody Capture

1032 rows × 3 columns

the .obs layer is empty now, and we need to repopulate it.

rna_cols = rna_a.obs.columns
prot_cols = prot_a.obs.columns

rnaobs = rna_a.obs.copy()
rnaobs.columns = ["rna:" + x for x in rna_cols]
protobs = prot.obs.copy()
protobs.columns = ["prot:" + x for x in prot_cols]
adata_paired.obs = pd.merge(rnaobs, protobs, left_index=True, right_index=True)

For more information on how anndata perform concatenation please check this tutorial