Jupyter Notebook

Integrate scRNA-seq datasets#

scRNA-seq data integration is the process of analyzing data from several scRNA sequencing experiments to uncover common or distinct biological insights and patterns.

Here, weโ€™ll demonstrate how to fetch two scRNA-seq datasets by registered metadata such as cell types to finally integrate them.

Setup#

!lamin load test-scrna
Hide code cell output
๐Ÿ’ก found cached instance metadata: /home/runner/.lamin/instance--testuser1--test-scrna.env
โœ… loaded instance: testuser1/test-scrna

import lamindb as ln
import lnschema_bionty as lb
import pandas as pd
import anndata as ad
โœ… loaded instance: testuser1/test-scrna (lamindb 0.51.2)
ln.track()
๐Ÿ’ก notebook imports: anndata==0.9.2 lamindb==0.51.2 lnschema_bionty==0.30.2 pandas==1.5.3
โœ… saved: Transform(id='agayZTonayqAz8', name='Integrate scRNA-seq datasets', short_name='scrna2', version='0', type=notebook, updated_at=2023-08-31 00:32:13, created_by_id='DzTjkKse')
โœ… saved: Run(id='KCdYSjc09PHVn5JwlTNJ', run_at=2023-08-31 00:32:13, transform_id='agayZTonayqAz8', created_by_id='DzTjkKse')

Access #

Query files by provenance metadata#

users = ln.User.lookup()
ln.Transform.filter(created_by=users.testuser1).search("register scrna")
id __ratio__
name
Validate & register scRNA-seq datasets Nv48yAceNSh8z8 53.846154
Integrate scRNA-seq datasets agayZTonayqAz8 47.619048
transform = ln.Transform.filter(id="Nv48yAceNSh8z8").one()
ln.File.filter(transform=transform).df()
storage_id key suffix accessor description version initial_version_id size hash hash_type transform_id run_id updated_at created_by_id
id
vDMpekU3pOyqlg5TTfLc emG4Bk3m None .h5ad AnnData Conde22 None None 28049505 WEFcMZxJNmMiUOFrcSTaig md5 Nv48yAceNSh8z8 MdQkKuMtppXchQY1C70T 2023-08-31 00:31:41 DzTjkKse
Hl9GtFwV9gghkmif8pyJ emG4Bk3m None .h5ad AnnData 10x reference pbmc68k None None 589484 eKVXV5okt5YRYjySMTKGEw md5 Nv48yAceNSh8z8 MdQkKuMtppXchQY1C70T 2023-08-31 00:32:04 DzTjkKse

Query files based on biological metadata#

assays = lb.ExperimentalFactor.lookup()
species = lb.Species.lookup()
cell_types = lb.CellType.lookup()
query = ln.File.filter(
    experimental_factors=assays.single_cell_rna_sequencing,
    species=species.human,
    cell_types=cell_types.conventional_dendritic_cell,
)
query.df()
storage_id key suffix accessor description version initial_version_id size hash hash_type transform_id run_id updated_at created_by_id
id
Hl9GtFwV9gghkmif8pyJ emG4Bk3m None .h5ad AnnData 10x reference pbmc68k None None 589484 eKVXV5okt5YRYjySMTKGEw md5 Nv48yAceNSh8z8 MdQkKuMtppXchQY1C70T 2023-08-31 00:32:04 DzTjkKse
vDMpekU3pOyqlg5TTfLc emG4Bk3m None .h5ad AnnData Conde22 None None 28049505 WEFcMZxJNmMiUOFrcSTaig md5 Nv48yAceNSh8z8 MdQkKuMtppXchQY1C70T 2023-08-31 00:31:41 DzTjkKse

Transform #

Compare gene sets#

Get file objects:

file1, file2 = query.list()
file1.describe()
๐Ÿ’ก File(id='Hl9GtFwV9gghkmif8pyJ', suffix='.h5ad', accessor='AnnData', description='10x reference pbmc68k', size=589484, hash='eKVXV5okt5YRYjySMTKGEw', hash_type='md5', updated_at=2023-08-31 00:32:04)

Provenance:
    ๐Ÿ—ƒ๏ธ storage: Storage(id='emG4Bk3m', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna', type='local', updated_at=2023-08-31 00:32:11, created_by_id='DzTjkKse')
    ๐Ÿ“” transform: Transform(id='Nv48yAceNSh8z8', name='Validate & register scRNA-seq datasets', short_name='scrna', version='0', type='notebook', updated_at=2023-08-31 00:32:04, created_by_id='DzTjkKse')
    ๐Ÿ‘ฃ run: Run(id='MdQkKuMtppXchQY1C70T', run_at=2023-08-31 00:30:45, transform_id='Nv48yAceNSh8z8', created_by_id='DzTjkKse')
    ๐Ÿ‘ค created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-31 00:32:11)
Features:
  var (X):
    ๐Ÿ”— index (695, bionty.Gene.id): ['Y7ufrND14Kay', 'KSe7z0us4sFY', 'RrWFxYmu7u59', 'wgO8E8BPRVSZ', '20jirs0l8k0p'...]
  external:
    ๐Ÿ”— assay (1, bionty.ExperimentalFactor): ['single-cell RNA sequencing']
    ๐Ÿ”— species (1, bionty.Species): ['human']
  obs (metadata):
    ๐Ÿ”— cell_type (9, bionty.CellType): ['B cell, CD19-positive', 'CD14-positive, CD16-negative classical monocyte', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'dendritic cell', 'effector memory CD4-positive, alpha-beta T cell, terminally differentiated']
file1.view_flow()
https://d33wubrfki0l68.cloudfront.net/37fd40bef00b22d46085af7d48bf05cac2d067a1/224fc/_images/c8083dad5bb38678d58d86a98a138eeb59924580c5bbe035d6c3fcedd6d69eb0.svg
file2.describe()
๐Ÿ’ก File(id='vDMpekU3pOyqlg5TTfLc', suffix='.h5ad', accessor='AnnData', description='Conde22', size=28049505, hash='WEFcMZxJNmMiUOFrcSTaig', hash_type='md5', updated_at=2023-08-31 00:31:41)

Provenance:
    ๐Ÿ—ƒ๏ธ storage: Storage(id='emG4Bk3m', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna', type='local', updated_at=2023-08-31 00:32:11, created_by_id='DzTjkKse')
    ๐Ÿ“” transform: Transform(id='Nv48yAceNSh8z8', name='Validate & register scRNA-seq datasets', short_name='scrna', version='0', type='notebook', updated_at=2023-08-31 00:32:04, created_by_id='DzTjkKse')
    ๐Ÿ‘ฃ run: Run(id='MdQkKuMtppXchQY1C70T', run_at=2023-08-31 00:30:45, transform_id='Nv48yAceNSh8z8', created_by_id='DzTjkKse')
    ๐Ÿ‘ค created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-31 00:32:11)
Features:
  var (X):
    ๐Ÿ”— index (36503, bionty.Gene.id): ['AZXeLz7ACIvE', '7NLIjs50oZ0F', 'kkQS5uRD6ael', 's9dhYXK0eu4A', 'xn9ypGwN0mSz'...]
  obs (metadata):
    ๐Ÿ”— cell_type (32, bionty.CellType): ['dendritic cell, human', 'plasmablast', 'animal cell', 'CD8-positive, alpha-beta memory T cell', 'CD4-positive helper T cell']
    ๐Ÿ”— assay (4, bionty.ExperimentalFactor): ['single-cell RNA sequencing', "10x 3' v3", "10x 5' v2", "10x 5' v1"]
    ๐Ÿ”— tissue (17, bionty.Tissue): ['lamina propria', 'thoracic lymph node', 'mesenteric lymph node', 'jejunal epithelium', 'transverse colon']
    ๐Ÿ”— donor (12, core.Label): ['D503', '637C', 'A37', '640C', 'A35']
file2.view_flow()
https://d33wubrfki0l68.cloudfront.net/6941b02636730ddf88238220a2716b1a000771c9/8a80f/_images/063591d5ca25d1c26d7d9c857a9abf03d6ad8ed1a2d79daec35427158da96407.svg

Load files into memory:

file1_adata = file1.load()
file2_adata = file2.load()
๐Ÿ’ก adding file Hl9GtFwV9gghkmif8pyJ as input for run KCdYSjc09PHVn5JwlTNJ, adding parent transform Nv48yAceNSh8z8
๐Ÿ’ก adding file vDMpekU3pOyqlg5TTfLc as input for run KCdYSjc09PHVn5JwlTNJ, adding parent transform Nv48yAceNSh8z8

Here we compute shared genes without loading files:

file1_genes = file1.features["var"]
file2_genes = file2.features["var"]

shared_genes = file1_genes & file2_genes
len(shared_genes)
695
shared_genes.list("symbol")[:10]
['MRPL21',
 'PRMT2',
 'CALM1',
 'TRAM1',
 'PRR7',
 'CD3G',
 'HLA-DQB1',
 'SESN2',
 'IFITM3',
 'CAPN1']

We also need to convert the ensembl_gene_id to symbol for file2 so that they can be concatenated:

mapper = pd.DataFrame(shared_genes.values_list("ensembl_gene_id", "symbol")).set_index(
    0
)[1]
mapper.head()
0
ENSG00000197345    MRPL21
ENSG00000160310     PRMT2
ENSG00000198668     CALM1
ENSG00000067167     TRAM1
ENSG00000131188      PRR7
Name: 1, dtype: object
file2_adata.var.rename(index=mapper, inplace=True)

Compare cell types#

file1_celltypes = file1.cell_types.all()
file2_celltypes = file2.cell_types.all()

shared_celltypes = file1_celltypes & file2_celltypes
shared_celltypes_names = shared_celltypes.list("name")
shared_celltypes_names
['conventional dendritic cell',
 'CD16-positive, CD56-dim natural killer cell, human']

We can now subset the two datasets by shared cell types:

file1_adata_subset = file1_adata[
    file1_adata.obs["cell_type"].isin(shared_celltypes_names)
]

file2_adata_subset = file2_adata[
    file2_adata.obs["cell_type"].isin(shared_celltypes_names)
]

Concatenate subsetted datasets:

adata_concat = ad.concat(
    [file1_adata_subset, file2_adata_subset],
    label="file",
    keys=[file1.description, file2.description],
)
adata_concat
AnnData object with n_obs ร— n_vars = 126 ร— 695
    obs: 'cell_type', 'file'
    obsm: 'X_umap'
adata_concat.obs.value_counts()
cell_type                                           file                 
CD16-positive, CD56-dim natural killer cell, human  Conde22                  114
conventional dendritic cell                         Conde22                    7
CD16-positive, CD56-dim natural killer cell, human  10x reference pbmc68k      3
conventional dendritic cell                         10x reference pbmc68k      2
dtype: int64
Hide code cell content
# clean up test instance
!lamin delete --force test-scrna
!rm -r ./test-scrna
๐Ÿ’ก deleting instance testuser1/test-scrna
โœ…     deleted instance settings file: /home/runner/.lamin/instance--testuser1--test-scrna.env
โœ…     instance cache deleted
โœ…     deleted '.lndb' sqlite file
โ—     consider manually deleting your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna