Jupyter Notebook

Validate & register multi-modal data#

scRNA data has moved beyond just RNA and can also include the measurements of other modalities such as chromatin accessibility, surface proteins or adaptive immune receptors. ECCITE-seq is designed to enable interrogation of single-cell transcriptomes together with surface protein markers in the context of CRISPR screens.

Here, we’ll showcase how to curate and register ECCITE-seq data from Papalexi21 in the form of MuData objects.

Setup#

Hide code cell content
!lamin init --storage ./test-multimodal --schema bionty
πŸ’‘ creating schemas: core==0.46.3 bionty==0.30.2 
βœ… saved: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-31 00:34:21)
βœ… saved: Storage(id='YH8MCTlq', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-multimodal', type='local', updated_at=2023-08-31 00:34:21, created_by_id='DzTjkKse')
βœ… loaded instance: testuser1/test-multimodal
πŸ’‘ did not register local instance on hub (if you want, call `lamin register`)

import lamindb as ln
import lnschema_bionty as lb

lb.settings.species = "human"
ln.settings.verbosity = 3
βœ… loaded instance: testuser1/test-multimodal (lamindb 0.51.2)
ln.track()
πŸ’‘ notebook imports: lamindb==0.51.2 lnschema_bionty==0.30.2
βœ… saved: Transform(id='yMWSFirS6qv2z8', name='Validate & register multi-modal data', short_name='multimodal', version='0', type=notebook, updated_at=2023-08-31 00:34:24, created_by_id='DzTjkKse')
βœ… saved: Run(id='fNR29iikre30hSCopWVS', run_at=2023-08-31 00:34:24, transform_id='yMWSFirS6qv2z8', created_by_id='DzTjkKse')

Papalexi21#

Let’s use a MuData object:

Transform #

Hide code cell content
mdata = ln.dev.datasets.mudata_papalexi21_subset()
mdata
MuData object with n_obs Γ— n_vars = 200 Γ— 300
  var:	'name'
  4 modalities
    rna:	200 x 173
      obs:	'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'nCount_GDO', 'nCount_ADT', 'nFeature_ADT', 'percent.mito', 'MULTI_ID', 'HTO_classification', 'guide_ID', 'gene_target', 'NT', 'perturbation', 'replicate', 'S.Score', 'G2M.Score', 'Phase'
      var:	'name'
    adt:	200 x 4
      obs:	'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'nCount_GDO', 'nCount_ADT', 'nFeature_ADT', 'percent.mito', 'MULTI_ID', 'HTO_classification', 'guide_ID', 'gene_target', 'NT', 'perturbation', 'replicate', 'S.Score', 'G2M.Score', 'Phase'
      var:	'name'
    hto:	200 x 12
      obs:	'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'nCount_GDO', 'nCount_ADT', 'nFeature_ADT', 'percent.mito', 'MULTI_ID', 'HTO_classification', 'guide_ID', 'gene_target', 'NT', 'perturbation', 'replicate', 'S.Score', 'G2M.Score', 'Phase'
      var:	'name'
    gdo:	200 x 111
      obs:	'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'nCount_GDO', 'nCount_ADT', 'nFeature_ADT', 'percent.mito', 'MULTI_ID', 'HTO_classification', 'guide_ID', 'gene_target', 'NT', 'perturbation', 'replicate', 'S.Score', 'G2M.Score', 'Phase'
      var:	'name'

MuData objects build on top of AnnData objects to store and serialize multimodal data. More information can be found on the MuData documentation.

First we register the file:

file = ln.File(
    "papalexi21_subset.h5mu", description="Sub-sampled MuData from Papalexi21"
)
file.save()
βœ… storing file 'eQFeys58seNFN57rAbkR' at '.lamindb/eQFeys58seNFN57rAbkR.h5mu'

Now let’s validate and register the 3 feature sets this data contains:

  1. RNA (gene expression)

  2. ADT (antibody derived tags reflecting surface proteins)

  3. obs (metadata)

For the two modalities rna and adt, we use bionty tables as the reference:

Validate #

mdata["rna"].var_names[:5]
Index(['RP5-827C21.6', 'XX-CR54.1', 'SH2D6', 'RP11-379B18.5', 'RP11-778D9.12'], dtype='object', name='index')
lb.Gene.validate(mdata["rna"].var_names, lb.Gene.symbol);
❗ 173 terms (100.00%) are not validated for symbol: RP5-827C21.6, XX-CR54.1, SH2D6, RP11-379B18.5, RP11-778D9.12, RP11-703G6.1, AC005150.1, RP11-717H13.1, CTC-498J12.1, CTC-467M3.1, ARHGAP26-AS1, GABRA1, HIST1H4K, HLA-DQB1-AS1, RP11-524H19.2, SPACA1, VNN1, AC006042.7, AC002066.1, AC073934.6, ...
genes = lb.Gene.from_values(mdata["rna"].var_names, lb.Gene.symbol)
ln.save(genes)
βœ… created 77 Gene records from Bionty matching symbol: SH2D6, ARHGAP26-AS1, GABRA1, HLA-DQB1-AS1, SPACA1, VNN1, CTAGE15, PFKFB1, TRPC5, RBPMS-AS1, CA8, CSMD3, ZNF483, AK8, TMEM72-AS1, ARAP1-AS2, CRYAB, HOXC-AS2, LRRIQ1, TUBA3C, ...
βœ… created 12 Gene records from Bionty matching synonyms: CTC-467M3.1, HIST1H4K, CASC1, LARGE, NBPF16, C1orf65, IBA57-AS1, KIAA1239, TMEM75, AP003419.16, FAM65C, C14orf177
❗ ambiguous validation in Bionty for 6 records: HLA-DQB1-AS1, CTAGE15, CTRB2, LGALS9C, PCDHB11, TBC1D3G
❗ did not create Gene records for 84 non-validated symbols: AC002066.1, AC004019.13, AC005150.1, AC006042.7, AC011558.5, AC026471.6, AC073934.6, AC091132.1, AC092295.4, AC092687.5, AE000662.93, AL132989.1, AP000442.4, CTA-373H7.7, CTB-134F13.1, CTB-31O20.9, CTC-498J12.1, CTD-2562J17.2, CTD-3012A18.1, CTD-3065B20.2, ...
mdata["rna"].var_names = lb.Gene.standardize(mdata["rna"].var_names, lb.Gene.symbol)
πŸ’‘ standardized 89/173 terms
validated = lb.Gene.validate(mdata["rna"].var_names, lb.Gene.symbol)
βœ… 89 terms (51.40%) are validated for symbol
❗ 84 terms (48.60%) are not validated for symbol: RP5-827C21.6, XX-CR54.1, RP11-379B18.5, RP11-778D9.12, RP11-703G6.1, AC005150.1, RP11-717H13.1, CTC-498J12.1, RP11-524H19.2, AC006042.7, AC002066.1, AC073934.6, RP11-268G12.1, U52111.14, RP11-235C23.5, RP11-12J10.3, RP11-324E6.9, RP11-187A9.3, RP11-365N19.2, RP11-346D14.1, ...
new_genes = [
    lb.Gene(symbol=symbol, species=lb.settings.species)
    for symbol in mdata["rna"].var_names[~validated]
]
ln.save(new_genes)
lb.Gene.validate(mdata["rna"].var_names, lb.Gene.symbol);
βœ… 173 terms (100.00%) are validated for symbol
feature_set_rna = ln.FeatureSet.from_values(
    mdata["rna"].var_names, field=lb.Gene.symbol
)
βœ… 173 terms (100.00%) are validated for symbol
mdata["adt"].var_names
Index(['CD86', 'PDL1', 'PDL2', 'CD366'], dtype='object', name='index')
lb.CellMarker.validate(mdata["adt"].var_names, field=lb.CellMarker.name);
❗ 4 terms (100.00%) are not validated for name: CD86, PDL1, PDL2, CD366
markers = lb.CellMarker.from_values(mdata["adt"].var_names, field=lb.CellMarker.name)
ln.save(markers)
βœ… created 4 CellMarker records from Bionty matching name: CD86, PDL1, PDL2, CD366
lb.CellMarker.validate(mdata["adt"].var_names, field=lb.CellMarker.name);
βœ… 4 terms (100.00%) are validated for name

Register #

feature_set_adt = ln.FeatureSet.from_values(
    mdata["adt"].var_names, field=lb.CellMarker.name
)
βœ… 4 terms (100.00%) are validated for name

Link them to file:

file.features.add_feature_set(feature_set_rna, slot="rna")
file.features.add_feature_set(feature_set_adt, slot="adt")

The 3rd feature set is the obs:

obs = mdata["rna"].obs

We’re only interested in a single metadata column:

ln.Feature(name="gene_target", type="category").save()
features = ln.Feature.from_df(obs)
ln.save(features)
feature_set_obs = ln.FeatureSet.from_df(obs)
βœ… 19 terms (100.00%) are validated for name
file.features.add_feature_set(feature_set_obs, slot="obs")
gene_targets = lb.Gene.from_values(obs["gene_target"], lb.Gene.symbol)
ln.save(gene_targets)
file.add_labels(gene_targets, feature="gene_target")
βœ… created 23 Gene records from Bionty matching symbol: IFNGR1, CAV1, IRF7, ATF2, NFKBIA, STAT1, SPI1, JAK2, STAT2, IFNGR2, CD86, STAT5A, SMAD4, ETV7, IRF1, UBE2L6, PDCD1LG2, BRD4, POU2F2, STAT3, ...
βœ… created 1 Gene record from Bionty matching synonyms: MARCH8
❗ ambiguous validation in Bionty for 4 records: MARCHF8, IRF7, IFNGR2, TNFRSF14
❗ did not create Gene record for 1 non-validated symbol: NT
βœ… linked feature 'gene_target' to registry 'bionty.Gene'
nt = ln.Label(name="NT", description="Non-targeting control of perturbations")
nt.save()
file.add_labels(nt, feature="gene_target")
βœ… linked feature 'gene_target' to registry 'core.Label'
for col in ["orig.ident", "perturbation", "replicate", "Phase", "guide_ID"]:
    labels = [ln.Label(name=name) for name in obs[col].unique()]
    ln.save(labels)
βœ… loaded Label record with exact same name: 'NT'

Because none of these labels seem like something we’d want to track in the registry or validate, we don’t link them to the file.

file.features
'rna': FeatureSet(id='IbnlrBXLe1xcnsnTnx2P', n=184, type='numeric', registry='bionty.Gene', hash='Y8lsRtXCZKyPPberKAF0', updated_at=2023-08-31 00:34:32, created_by_id='DzTjkKse')
'adt': FeatureSet(id='14DjqUgwaX9Uw70nKiyy', n=4, type='numeric', registry='bionty.CellMarker', hash='b-CtyjgPRO0WN27lTOqC', updated_at=2023-08-31 00:34:32, created_by_id='DzTjkKse')
'obs': FeatureSet(id='kD5d0YHr8XmIDyBneujw', n=19, registry='core.Feature', hash='99-Pafh1DXs3QkUD6baP', updated_at=2023-08-31 00:34:33, created_by_id='DzTjkKse')
file.describe()
πŸ’‘ File(id='eQFeys58seNFN57rAbkR', suffix='.h5mu', accessor='MuData', description='Sub-sampled MuData from Papalexi21', size=606320, hash='RaivS3NesDOP-6kNIuaC3g', hash_type='md5', updated_at=2023-08-31 00:34:25)

Provenance:
    πŸ—ƒοΈ storage: Storage(id='YH8MCTlq', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-multimodal', type='local', updated_at=2023-08-31 00:34:21, created_by_id='DzTjkKse')
    πŸ’« transform: Transform(id='yMWSFirS6qv2z8', name='Validate & register multi-modal data', short_name='multimodal', version='0', type=notebook, updated_at=2023-08-31 00:34:25, created_by_id='DzTjkKse')
    πŸ‘£ run: Run(id='fNR29iikre30hSCopWVS', run_at=2023-08-31 00:34:24, transform_id='yMWSFirS6qv2z8', created_by_id='DzTjkKse')
    πŸ‘€ created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-31 00:34:21)
Features:
  adt:
    πŸ”— index (4, bionty.CellMarker.id): ['BK30rjK34sZd', '82nG0xqSuEQD', 'kbrA7wdDuqDK', 'L0m6f7FPiDeg'...]
  rna:
    πŸ”— index (184, bionty.Gene.id): ['wkGaHGTuTC00', 'WefD8C4raEa5', 'nfMjnjieml5b', 'lsBbhfQnMVj7', 'tLV3P0Sj0ayW'...]
  obs (metadata):
    πŸ”— gene_target (bionty.Gene|core.Label)
        πŸ”— gene_target (28, bionty.Gene): ['IFNGR2', 'NFKBIA', 'IRF7', 'BRD4', 'STAT2']
        πŸ”— gene_target (1, core.Label): ['NT']
file.view_flow()
https://d33wubrfki0l68.cloudfront.net/910c553dbe209dd6e1bd6421c3d3c750fff58854/cbf7a/_images/3818332427455c61e9ed2a435ac7e99c41b0d1c67564e6b379ca400748fb1094.svg
Hide code cell content
!lamin delete --force test-multimodal
!rm -r test-multimodal
πŸ’‘ deleting instance testuser1/test-multimodal
βœ…     deleted instance settings file: /home/runner/.lamin/instance--testuser1--test-multimodal.env
βœ…     instance cache deleted
βœ…     deleted '.lndb' sqlite file
❗     consider manually deleting your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/test-multimodal