Working with scverse objects in backed mode#
In this tutorial we show how to work with scverse data objects without loading the full datasets in memory. AnnData and MuData objects are saved to disk as .h5ad
and .h5mu
files. These formats save HDF5 stores (via h5py) that can be accessed without loading heavy data matrices in memory (i.e. in backed mode), as a “read-only” object.
In this tutorial, we demonstrate how to work in backed mode for:
quickly inspecting the content of an h5ad/h5mu file
exploratory data analysis (e.g. plotting embeddings)
selecting a subset of interest in a very large dataset
import anndata
import mudata
import scanpy as sc
import pandas as pd
import muon as mu
# for datasets
import mudatasets
from warnings import filterwarnings
filterwarnings("ignore", category=FutureWarning)
filterwarnings("ignore", category=pd.errors.SettingWithCopyWarning)
sc.set_figure_params(fontsize=18, figsize=[6, 6])
Working with backed AnnData objects#
Let’s load an anndata object in memory
adata = sc.datasets.pbmc3k_processed()
adata = adata.raw.to_adata() # Store raw counts in adata.X
Let’s check the size of this AnnData object (i.e. the amount of RAM needed to load it)
# Convenience method for computing the size of objects
def print_size_in_MB(x):
print("{:.3} MB".format(x.__sizeof__() / 1e6))
print_size_in_MB(adata)
21.2 MB
Now let’s save this object in .h5ad
format and re-load it in backed (on-disk) mode.
adata.write_h5ad("./data/pbmc3k_processed_raw.h5ad")
adata = sc.read_h5ad("./data/pbmc3k_processed_raw.h5ad", backed=True)
Now the size of our AnnData object is about 3 MB, with the gene expression matrix saved only on disk
print_size_in_MB(adata)
3.28 MB
We can check whether an anndata object is loaded in backed mode using the adata.isbacked
attribute
adata.isbacked
True
while the path to the file on disk is stored in adata.filename
adata.filename
PosixPath('data/pbmc3k_processed_raw.h5ad')
Exploratory data analysis#
When loading an h5ad file in backed mode, we can use all scverse functions that require to read the data without having in memory the full adata.X
matrix. For example we can plot embeddings or gene expression matrices. The gene expression values for a given gene or set of genes are loaded from disk on the function call.
sc.pl.umap(adata, color=["louvain", "CD4"], wspace=0.6)

sc.pl.dotplot(adata, var_names=["NKG7", "CD4", "CD3E"], groupby="louvain")

The full gene expression matrix in adata.X
is not loaded, but all the cell- and gene-level metadata is in memory and can be modified. Therefore we can run methods that modify obs/obsm/obsp. For example, we can compute KNN graphs and nearest neighbor embeddings
sc.pp.neighbors(adata, n_neighbors=20)
sc.tl.umap(adata)
Or we can perform clustering
sc.tl.leiden(adata, resolution=0.5, key_added="leiden_res0.5")
adata.obs
n_genes | percent_mito | n_counts | louvain | leiden_res0.5 | |
---|---|---|---|---|---|
index | |||||
AAACATACAACCAC-1 | 781 | 0.030178 | 2419.0 | CD4 T cells | 0 |
AAACATTGAGCTAC-1 | 1352 | 0.037936 | 4903.0 | B cells | 3 |
AAACATTGATCAGC-1 | 1131 | 0.008897 | 3147.0 | CD4 T cells | 0 |
AAACCGTGCTTCCG-1 | 960 | 0.017431 | 2639.0 | CD14+ Monocytes | 1 |
AAACCGTGTATGCG-1 | 522 | 0.012245 | 980.0 | NK cells | 2 |
... | ... | ... | ... | ... | ... |
TTTCGAACTCTCAT-1 | 1155 | 0.021104 | 3459.0 | CD14+ Monocytes | 1 |
TTTCTACTGAGGCA-1 | 1227 | 0.009294 | 3443.0 | B cells | 3 |
TTTCTACTTCCTCG-1 | 622 | 0.021971 | 1684.0 | B cells | 3 |
TTTGCATGAGAGGC-1 | 454 | 0.020548 | 1022.0 | B cells | 3 |
TTTGCATGCCTCAC-1 | 724 | 0.008065 | 1984.0 | CD4 T cells | 0 |
2638 rows × 5 columns
sc.pl.umap(adata, color=["louvain", "leiden_res0.5"])

Extracting a subset of AnnData#
Another common use-case for backed mode is to inspect a dataset to then select and save a subset of data that we are interested in using for analysis.
For example, if we are interested in performing analysis just on T cells, we can subset the AnnData object and load in memory just the gene expression matrix of the subset of interest
T_clusters = ["CD4 T cells", "CD8 T cells"]
adata_subset = adata[adata.obs["louvain"].isin(T_clusters)].to_memory()
adata_subset
AnnData object with n_obs × n_vars = 1460 × 13714
obs: 'n_genes', 'percent_mito', 'n_counts', 'louvain', 'leiden_res0.5'
var: 'n_cells'
uns: 'draw_graph', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups', 'umap', 'leiden', 'leiden_res0.5_colors'
obsm: 'X_draw_graph_fr', 'X_pca', 'X_tsne', 'X_umap'
obsp: 'connectivities', 'distances'
adata_subset.isbacked
False
If we want to keep working in backed mode with this data subset, we can copy it specifying a new .h5ad
file on disk where this will be saved.
adata_subset = adata[adata.obs["louvain"].isin(T_clusters)].copy(filename="./data/pbmc3k_processed_CD4Tcells.h5ad")
adata_subset
AnnData object with n_obs × n_vars = 1460 × 13714 backed at 'data/pbmc3k_processed_CD4Tcells.h5ad'
obs: 'n_genes', 'percent_mito', 'n_counts', 'louvain', 'leiden_res0.5'
var: 'n_cells'
uns: 'draw_graph', 'leiden', 'leiden_res0.5_colors', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups', 'umap'
obsm: 'X_draw_graph_fr', 'X_pca', 'X_tsne', 'X_umap'
obsp: 'connectivities', 'distances'
adata_subset.isbacked
True
Things you can’t do with AnnData in backed mode#
Backed mode loads gene expression matrices in “read-only” mode, so functions that attempt to modify adata.X
will throw errors.
For example, normalizing with
sc.pp.normalize_total(adata)
Throws the following error
AttributeError: 'SparseDataset' object has no attribute 'sum'
but the gene expression matrix can be accessed directly using numerical indices.
adata.X[:, 0 : adata.n_vars]
<2638x13714 sparse matrix of type '<class 'numpy.float32'>'
with 2238732 stored elements in Compressed Sparse Row format>
Of note, in the current version of anndata
the matrices in adata.layers
are not backed.
Another caveat is that an object can be opened in backed mode only from one process/notebook simultaneously. Attempting to load it (with sc.read_h5ad
) will throw an Input/Output error.
Save changes made to backed AnnData#
We can save the changes we’ve made to the backed anndata by writing back to the linked h5ad file.
adata.write_h5ad()
If we re-load this object from file, we will find all the updated information in the obs
slot
adata = sc.read_h5ad("./data/pbmc3k_processed_raw.h5ad", backed=True)
adata
AnnData object with n_obs × n_vars = 2638 × 13714 backed at 'data/pbmc3k_processed_raw.h5ad'
obs: 'n_genes', 'percent_mito', 'n_counts', 'louvain', 'leiden_res0.5'
var: 'n_cells'
uns: 'draw_graph', 'leiden', 'leiden_res0.5_colors', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups', 'umap'
obsm: 'X_draw_graph_fr', 'X_pca', 'X_tsne', 'X_umap'
obsp: 'connectivities', 'distances'
We can also save to a different file
adata.write_h5ad("./data/pbmc3k_processed_newclustering.h5ad")
Working with backed MuData objects#
Working with data on disk is particularly useful when working with data modalities different from gene expression, where the number of measured features is typically larger. For example, in data from scATAC-seq experiments accessibility is usually measured at hundreds of thousands of genomic regions, making feature x cell matrices prohibitively large to load in memory.
For this application, also MuData objects support working in backed mode.
# Download mudata object and save in h5mu format
mdata = mudatasets.load("pbmc10k_multiome")
mdata["rna"].var_names_make_unique()
mdata.write_h5mu("./data/pbmc10k_multiome.h5mu")
■ Loading filtered_feature_bc_matrix.h5...
Added `interval` annotation for features from /home/runner/mudatasets/pbmc10k_multiome/filtered_feature_bc_matrix.h5
We can see that the slot storing the scATAC accessibility matrix is quite large when loaded in memory
mdata["atac"].isbacked
False
print_size_in_MB(mdata["atac"])
7.16e+02 MB
Let’s now load the .h5mu
object in backed mode using the read_h5mu
function from the muon package.
mdata = mu.read_h5mu("./data/pbmc10k_multiome.h5mu", backed=True)
mdata
MuData object with n_obs × n_vars = 11909 × 144978 backed at 'data/pbmc10k_multiome.h5mu' var: 'gene_ids', 'feature_types', 'genome', 'interval' 2 modalities rna: 11909 x 36601 var: 'gene_ids', 'feature_types', 'genome', 'interval' atac: 11909 x 108377 var: 'gene_ids', 'feature_types', 'genome', 'interval'
Now the feature x cell matrices for both modalities are on-disk, and the object takes up a lot less memory.
print_size_in_MB(mdata["atac"])
31.6 MB
Working with only one modality in memory#
If we are analysing a multi-modal dataset, we might need to handle and process only one modality at a time. In this case we can keep the other modality on disk to reduce memory usage. For example, we can load the RNA modality only, to run clustering.
mdata.mod["rna"] = mdata.mod["rna"].to_memory()
mdata["rna"].isbacked
False
mdata["atac"].isbacked
True
Now we can run preprocessing of the expression data
rna = mdata.mod["rna"]
## Normalization
sc.pp.normalize_total(rna, target_sum=1e4)
sc.pp.log1p(rna)
## HVG selection
sc.pp.highly_variable_genes(rna, min_mean=0.02, max_mean=4, min_disp=0.5)
## Dim reduction and embeddings
sc.tl.pca(rna, svd_solver="arpack")
sc.pp.neighbors(rna, n_neighbors=10, n_pcs=20)
sc.tl.umap(rna)
sc.tl.leiden(rna, resolution=1.0)
sc.pl.umap(rna, color="leiden", legend_loc="on data")

Now all the changes are saved in the MuData object.
mdata["rna"]
AnnData object with n_obs × n_vars = 11909 × 36601
obs: 'leiden'
var: 'gene_ids', 'feature_types', 'genome', 'interval', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
uns: 'log1p', 'hvg', 'pca', 'neighbors', 'umap', 'leiden', 'leiden_colors'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
obsp: 'distances', 'connectivities'
We can save changes to the h5mu file on disk, without needing to load the ATAC modality dataset
## Save log-normalized files in layers
# (write_h5mu doesn't rewrite .X if the file is in backed mode)
mdata["rna"].layers["lognorm_counts"] = mdata["rna"].X.copy()
mdata.write_h5mu()
Now we can reload the updated object to have both modalities on disk
mdata = mu.read_h5mu("./data/pbmc10k_multiome.h5mu", backed=True)
mdata
MuData object with n_obs × n_vars = 11909 × 144978 backed at 'data/pbmc10k_multiome.h5mu' var: 'gene_ids', 'feature_types', 'genome', 'interval' 2 modalities rna: 11909 x 36601 obs: 'leiden' var: 'gene_ids', 'feature_types', 'genome', 'interval', 'highly_variable', 'means', 'dispersions', 'dispersions_norm' uns: 'hvg', 'leiden', 'leiden_colors', 'log1p', 'neighbors', 'pca', 'umap' obsm: 'X_pca', 'X_umap' varm: 'PCs' layers: 'lognorm_counts' obsp: 'connectivities', 'distances' atac: 11909 x 108377 var: 'gene_ids', 'feature_types', 'genome', 'interval'
mdata["atac"].isbacked and mdata["rna"].isbacked
True
Extracting a MuData subset#
Similarly to the AnnData case, we can use the backed mode to identify a subset of data that we are interested in, and save it as a smaller object to load into memory.
For example, we could subset the PBMC dataset to include only the T and NK cells (clusters expressing CD2)
sc.pl.umap(mdata["rna"], color=["leiden", "CD2"], layer="lognorm_counts", legend_loc="on data")

sc.pl.dotplot(mdata["rna"], var_names=["CD2"], groupby="leiden", figsize=(10, 3), dendrogram=True, swap_axes=True)
WARNING: dendrogram data not found (using key=dendrogram_leiden). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.

nkt_cell_clusters = ["19", "20", "13", "11", "9", "14", "19", "5", "0", "1", "3"]
mdata[mdata.obs["rna:leiden"].isin(nkt_cell_clusters)]
View of MuData object with n_obs × n_vars = 6469 × 144978 backed at 'data/pbmc10k_multiome.h5mu' var: 'gene_ids', 'feature_types', 'genome', 'interval' 2 modalities rna: 6469 x 36601 obs: 'leiden' var: 'gene_ids', 'feature_types', 'genome', 'interval', 'highly_variable', 'means', 'dispersions', 'dispersions_norm' uns: 'hvg', 'leiden', 'leiden_colors', 'log1p', 'neighbors', 'pca', 'umap', 'dendrogram_leiden' obsm: 'X_pca', 'X_umap' varm: 'PCs' layers: 'lognorm_counts' obsp: 'connectivities', 'distances' atac: 6469 x 108377 var: 'gene_ids', 'feature_types', 'genome', 'interval'
We can save this data subset in a new .h5mu object using the copy method
mdata_nkt = mdata[mdata.obs["rna:leiden"].isin(nkt_cell_clusters)].copy(
filename="./data/pbmc10k_multiome.NKTcells.h5mu"
)
mdata_nkt
MuData object with n_obs × n_vars = 6469 × 144978 backed at 'data/pbmc10k_multiome.NKTcells.h5mu' var: 'gene_ids', 'feature_types', 'genome', 'interval' 2 modalities rna: 6469 x 36601 obs: 'leiden' var: 'gene_ids', 'feature_types', 'genome', 'interval', 'highly_variable', 'means', 'dispersions', 'dispersions_norm' uns: 'dendrogram_leiden', 'hvg', 'leiden', 'leiden_colors', 'log1p', 'neighbors', 'pca', 'umap' obsm: 'X_pca', 'X_umap' varm: 'PCs' layers: 'lognorm_counts' obsp: 'connectivities', 'distances' atac: 6469 x 108377 var: 'gene_ids', 'feature_types', 'genome', 'interval'
We can load this object in memory by setting mdata.filename
to None
mdata_nkt
MuData object with n_obs × n_vars = 6469 × 144978 backed at 'data/pbmc10k_multiome.NKTcells.h5mu' var: 'gene_ids', 'feature_types', 'genome', 'interval' 2 modalities rna: 6469 x 36601 obs: 'leiden' var: 'gene_ids', 'feature_types', 'genome', 'interval', 'highly_variable', 'means', 'dispersions', 'dispersions_norm' uns: 'dendrogram_leiden', 'hvg', 'leiden', 'leiden_colors', 'log1p', 'neighbors', 'pca', 'umap' obsm: 'X_pca', 'X_umap' varm: 'PCs' layers: 'lognorm_counts' obsp: 'connectivities', 'distances' atac: 6469 x 108377 var: 'gene_ids', 'feature_types', 'genome', 'interval'
mdata_nkt.filename = None
mdata_nkt.isbacked
False
or reading from file
mdata_nkt = mu.read_h5mu("./data/pbmc10k_multiome.NKTcells.h5mu")
mdata_nkt
MuData object with n_obs × n_vars = 6469 × 144978 var: 'gene_ids', 'feature_types', 'genome', 'interval' 2 modalities rna: 6469 x 36601 obs: 'leiden' var: 'gene_ids', 'feature_types', 'genome', 'interval', 'highly_variable', 'means', 'dispersions', 'dispersions_norm' uns: 'dendrogram_leiden', 'hvg', 'leiden', 'leiden_colors', 'log1p', 'neighbors', 'pca', 'umap' obsm: 'X_pca', 'X_umap' varm: 'PCs' layers: 'lognorm_counts' obsp: 'connectivities', 'distances' atac: 6469 x 108377 var: 'gene_ids', 'feature_types', 'genome', 'interval'
mdata_nkt.isbacked
False
Things you can’t do with MuData in backed mode#
The current MuData implementation doesn’t support having a different number of obs between modalities in backed mode.