Concatenating multimodal experiments#
import scanpy as sc
import mudata as md
from mudata import MuData
import anndata as ad
import numpy as np
import pandas as pd
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)
np.random.seed(1979)
Sometimes, you may want to concatenate 2 MuData
objects because they represent complementary slices of the same dataset on which you have applied different processing. Think of analysing B and T cells separately for your PBMC typical dataset.
Other times instead you need to concatenate 2 modalities into one AnnData
because the tool you’re working with doesn’t currently support MuData
(yeah we know, how dare they?).
We will showcase here these 2 scenarios of concatenation.
Note
Native concatenation of two MuData
objects is currently discussed in
scverse/mudata#20 and may
eventually make parts of this tutorial obsolete.
Note that for some modalities, concatenation requires extra care. For instance, in the case of ATAC-seq, concatenation does not make sense unless fragments are aggregated first.
First, we need to import the raw data for a dataset of our choice. We use mudatasets package that conveniently collects some useful 10X single cell datasets that are publicly available. For this example we need a multimodal dataset, so select the citeseq 5k dataset, a collection of healthy PBMCs for which 2 modalities were profiled, RNA and PROTEINS.
import mudatasets as mds
mds.list_datasets()
['brain3k_multiome',
'pbmc5k_citeseq',
'pbmc10k_multiome',
'pbmc3k_multiome',
'brain9k_multiome']
mds.info("pbmc5k_citeseq")
pbmc5k = mds.load("pbmc5k_citeseq", files=["filtered_feature_bc_matrix.h5"])
■ File filtered_feature_bc_matrix.h5 from pbmc5k_citeseq has been found at /home/runner/mudatasets/pbmc5k_citeseq/filtered_feature_bc_matrix.h5
■ Checksum is validated (md5) for filtered_feature_bc_matrix.h5
■ Loading filtered_feature_bc_matrix.h5...
pbmc5k
MuData object with n_obs × n_vars = 5247 × 33570 var: 'gene_ids', 'feature_types', 'genome' 2 modalities rna: 5247 x 33538 var: 'gene_ids', 'feature_types', 'genome' prot: 5247 x 32 var: 'gene_ids', 'feature_types', 'genome'
We create 2 different subsamples of the same underlying data for both RNA and PROT modalities.
rna = pbmc5k.mod["rna"]
prot = pbmc5k.mod["prot"]
rna_a = rna[np.arange(300), np.sort(np.random.choice(np.arange(1000), 1000, replace=False))].copy()
prot_a = prot[rna_a.obs_names,].copy()
rna_b = rna[np.arange(500, 900), np.sort(np.random.choice(np.arange(3000), 1000, replace=False))].copy()
prot_b = prot[rna_b.obs_names, np.arange(15)].copy()
And we create the respective MuData
objects.
mdata_a = MuData({"prot": prot_a, "rna": rna_a})
mdata_b = MuData({"prot": prot_b, "rna": rna_b})
mdata_a
MuData object with n_obs × n_vars = 300 × 1032 var: 'gene_ids', 'feature_types', 'genome' 2 modalities prot: 300 x 32 var: 'gene_ids', 'feature_types', 'genome' rna: 300 x 1000 var: 'gene_ids', 'feature_types', 'genome'
mdata_b
MuData object with n_obs × n_vars = 400 × 1015 var: 'gene_ids', 'feature_types', 'genome' 2 modalities prot: 400 x 15 var: 'gene_ids', 'feature_types', 'genome' rna: 400 x 1000 var: 'gene_ids', 'feature_types', 'genome'
as you see, the 2 RNA subsamples don’t share any cells, but they share some features. It’s the same for the PROT assay.
len(list(set(rna_a.obs_names.tolist()) & set(rna_b.obs_names.tolist())))
0
len(list(set(rna_a.var_names.tolist()) & set(rna_b.var_names.tolist())))
345
len(list(set(prot_a.var_names.tolist()) & set(prot_b.var_names.tolist())))
15
1. Concatenate datasets, by modality#
In the AnnData
convention, we store observations (samples or cells) in rows (axis=0
)and variables (genes, proteins, atac regions, etc …) in columns (axis=1
).
Both the rows and columns of this matrix are indexed, which allows us to link between each other the structured layers of the AnnData object.
When we interact with both axes of these matrices, we modify the same axes on all the linked layers.
In scRNA-seq data, each row corresponds to a cell with a barcode, and each column corresponds to a gene with a gene id, but in the protein assay of a CITEseq experiment the cells are the same along the axis=0
and the features are different.
To collect all the cells and features from 2 datasets we first have to concatenate each anndata and then build a new mudata with these.
By default, anndata concatenates on axis=0
ad.concat([rna_a, rna_b])
AnnData object with n_obs × n_vars = 700 × 345
ad.concat([rna_a, rna_b], axis=0)
AnnData object with n_obs × n_vars = 700 × 345
You may have noticed that anndata also defaults to create a concatenated version of the 2 RNA subsets with only the features that the 2 matrices have in common. This is the default scenario obtained by setting the parameter join="inner"
.
There may be instances in which you don’t want to lose the features that are missing from one of the 2 RNA, so let’s try setting join="outer"
ad.concat([rna_a, rna_b], axis=0, join="outer")
AnnData object with n_obs × n_vars = 700 × 1655
Anndata is also filling the variables that don’t match with 0
, instead of na
values.
NB since axis=0
is the default behaviour, we will omit it in the future calls of the concat
command for simplicity
ad.concat([rna_a, rna_b], join="outer").X.toarray()
array([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]], dtype=float32)
np.isnan(ad.concat([rna_a, rna_b], join="outer").X.toarray()).any()
False
We can use the same convention to concatenate the two protein assays.
rna_c = ad.concat([rna_a, rna_b], join="outer")
prot_c = ad.concat([prot_a, prot_b], join="outer")
And now we create the new MuData
object with the newly concatenated assays
full = MuData({"rna": rna_c, "prot": prot_c})
full
MuData object with n_obs × n_vars = 700 × 1687 2 modalities rna: 700 x 1655 prot: 700 x 32
2. Concatenating different modalities#
You may want to concatenate the RNA and the PROT modalities of the same cells. While we don’t recommend using this type of concatenation, because we believe that every basic operation you would want to perform on a multimodal object is covered by creating a MuData
object instead, we know that some of the tools that deal with multimodal data integration have not implemented MuData support yet.
rna_a
AnnData object with n_obs × n_vars = 300 × 1000
var: 'gene_ids', 'feature_types', 'genome'
prot_a
AnnData object with n_obs × n_vars = 300 × 32
var: 'gene_ids', 'feature_types', 'genome'
adata_paired = ad.concat([rna_a, prot_a], axis=1)
adata_paired
AnnData object with n_obs × n_vars = 300 × 1032
var: 'gene_ids', 'feature_types', 'genome'
we now have a concatenated anndata, whith 1032 .var
and 600 .obs
. Let’s take a look at the individual layers.
adata_paired.obs
AAACCCAAGAGACAAG-1 |
---|
AAACCCAAGGCCTAGA-1 |
AAACCCAGTCGTGCCA-1 |
AAACCCATCGTGCATA-1 |
AAACGAAAGACAAGCC-1 |
... |
ACACTGAAGTTCCGGC-1 |
ACACTGAGTGCCCGTA-1 |
ACACTGAGTTCGTTCC-1 |
ACAGAAAAGGTACTGG-1 |
ACAGAAACAAATGGAT-1 |
300 rows × 0 columns
adata_paired.var
gene_ids | feature_types | genome | |
---|---|---|---|
MIR1302-2HG | ENSG00000243485 | Gene Expression | GRCh38 |
FAM138A | ENSG00000237613 | Gene Expression | GRCh38 |
OR4F5 | ENSG00000186092 | Gene Expression | GRCh38 |
AL627309.1 | ENSG00000238009 | Gene Expression | GRCh38 |
AL627309.3 | ENSG00000239945 | Gene Expression | GRCh38 |
... | ... | ... | ... |
HLA-DR_TotalSeqB | HLA-DR | Antibody Capture | |
TIGIT_TotalSeqB | TIGIT | Antibody Capture | |
IgG1_control_TotalSeqB | IgG1 | Antibody Capture | |
IgG2a_control_TotalSeqB | IgG2a | Antibody Capture | |
IgG2b_control_TotalSeqB | IgG2b | Antibody Capture |
1032 rows × 3 columns
the .obs
layer is empty now, and we need to repopulate it.
rna_cols = rna_a.obs.columns
prot_cols = prot_a.obs.columns
rnaobs = rna_a.obs.copy()
rnaobs.columns = ["rna:" + x for x in rna_cols]
protobs = prot.obs.copy()
protobs.columns = ["prot:" + x for x in prot_cols]
adata_paired.obs = pd.merge(rnaobs, protobs, left_index=True, right_index=True)
For more information on how anndata perform concatenation please check this tutorial