Jupyter Notebook

Validate & register flow cytometry data#

Flow cytometry is a technique used to analyze and sort cells or particles based on their physical and chemical characteristics as they flow in a fluid stream through a laser beam.

Here, we’ll transform, validate and register two flow cytometry datasets (Alpert19 and FlowIO sample) to demonstrate how to create and query a custom flow cytometry registry.

!lamin init --storage ./test-flow --schema bionty
Hide code cell output
💡 creating schemas: core==0.46.3 bionty==0.30.2 
✅ saved: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-31 00:33:05)
✅ saved: Storage(id='NiTcUlxC', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-flow', type='local', updated_at=2023-08-31 00:33:05, created_by_id='DzTjkKse')
✅ loaded instance: testuser1/test-flow
💡 did not register local instance on hub (if you want, call `lamin register`)

import lamindb as ln
import lnschema_bionty as lb
import readfcs

lb.settings.species = "human"
✅ loaded instance: testuser1/test-flow (lamindb 0.51.2)
ln.track()
💡 notebook imports: lamindb==0.51.2 lnschema_bionty==0.30.2 readfcs==1.1.6
✅ saved: Transform(id='OWuTtS4SAponz8', name='Validate & register flow cytometry data', short_name='flow', version='0', type=notebook, updated_at=2023-08-31 00:33:08, created_by_id='DzTjkKse')
✅ saved: Run(id='brWmI66Jm8AhdPJqkSMs', run_at=2023-08-31 00:33:08, transform_id='OWuTtS4SAponz8', created_by_id='DzTjkKse')

Alpert19#

Transform #

(Here we skip steps of data transformations, which often includes filtering, normalizing, or formatting data.)

We start with a flow cytometry file from Alpert19:

ln.dev.datasets.file_fcs_alpert19(
    populate_registries=True,  # pre-populate registries to simulate an used instance
)


PosixPath('Alpert19.fcs')

Use readfcs to read the fcs file into memory:

adata = readfcs.read("Alpert19.fcs")
adata
AnnData object with n_obs × n_vars = 166537 × 40
    var: 'n', 'channel', 'marker', '$PnB', '$PnE', '$PnR'
    uns: 'meta'

Validate #

First, let’s validate the features in .var.

We’ll use the CellMarker reference to link features:

lb.CellMarker.validate(adata.var.index, "name");
27 terms (67.50%) are validated for name
13 terms (32.50%) are not validated for name: Time, Cell_length, Dead, (Ba138)Dd, Bead, CD19, CD4, IgD, CD11b, CD14, CCR6, CCR7, PD-1

We see that many features aren’t validated. Let’s standardize the identifiers first to get rid of synonyms:

adata.var.index = lb.CellMarker.standardize(adata.var.index)
💡 standardized 35/40 terms

Great, now we can validate our markers once more:

validated = lb.CellMarker.validate(adata.var.index, "name")
35 terms (87.50%) are validated for name
5 terms (12.50%) are not validated for name: Time, Cell_length, Dead, (Ba138)Dd, Bead

Things look much better, but we still have 5 CellMaker records that seem more like metadata. Hence, let’s curate the AnnData object a bit more.

Let’s move metadata (non-validated cell markers) into adata.obs:

adata.obs = adata[:, ~validated].to_df()
adata = adata[:, validated].copy()

Now we have a clean panel of 35 cell markers:

lb.CellMarker.validate(adata.var.index, "name");
35 terms (100.00%) are validated for name

Next, let’s register the metadata features we moved to .obs:

# Feature.from_df creates feature records with type auto-populated
features = ln.Feature.from_df(adata.obs)
ln.add(features)

In addition, We’d also like to link this file with external features:

ln.Feature.validate("assay", "name")
lb.ExperimentalFactor.validate("FACS", "name");
1 term (100.00%) is validated for name
1 term (100.00%) is not validated for name: FACS

Since we never validated the term “FACS”, let’s search for it’s ontology and register it:

lb.ExperimentalFactor.bionty().search("FACS").head(2)
ontology_id definition synonyms parents molecule instrument measurement __ratio__
name
fluorescence-activated cell sorting EFO:0009108 A Flow Cytometry Assay That Provides A Method ... FACS|FAC sorting [] None None None 100.000000
acute chest syndrome EFO:0007129 A Vaso-Occlusive Crisis Of The Pulmonary Vascu... ACS|Acute Chest Syndrome|acute chest syndrome|... [EFO:0003818] None None None 85.714286
facs = lb.ExperimentalFactor.from_bionty(ontology_id="EFO:0009108")
facs.save()
✅ created 1 ExperimentalFactor record from Bionty matching ontology_id: EFO:0009108

Adding a new modality:

modality = ln.Modality(name="protein", description="readouts of protein abundance")
modality.save()

Register #

file = ln.File.from_anndata(adata, description="Alpert19", var_ref=lb.CellMarker.name)
💡 file will be copied to default storage upon `save()` with key `None` ('.lamindb/5OYjUKWnLnHsRjqirdpn.h5ad')
💡 parsing feature names of X stored in slot 'var'
35 terms (100.00%) are validated for name
✅    linked: FeatureSet(id='7QCcV3DoHj2jx4shgdlS', n=35, type='float', registry='bionty.CellMarker', hash='ldY9_GmptHLCcT7Nrpgo', created_by_id='DzTjkKse')
💡 parsing feature names of slot 'obs'
5 terms (100.00%) are validated for name
✅    linked: FeatureSet(id='MBDhHQxRTnN1J6Lt22oA', n=5, registry='core.Feature', hash='Ji1mZjV_jv2kGaWJrobX', modality_id='Rkh7Kh9E', created_by_id='DzTjkKse')
file.save()
✅ saved 2 feature sets for slots: 'var','obs'
✅ storing file '5OYjUKWnLnHsRjqirdpn' at '.lamindb/5OYjUKWnLnHsRjqirdpn.h5ad'
file.add_labels(facs, "assay")
file.add_labels(lb.settings.species, "species")
✅ linked new feature 'assay' together with new feature set FeatureSet(id='flrGQGa1ziUVN4lpvpbr', n=1, registry='core.Feature', hash='kGU8J8-duA5oc0mrtTkd', updated_at=2023-08-31 00:33:16, modality_id='Rkh7Kh9E', created_by_id='DzTjkKse')
💡 no file links to it anymore, deleting feature set FeatureSet(id='flrGQGa1ziUVN4lpvpbr', n=1, registry='core.Feature', hash='kGU8J8-duA5oc0mrtTkd', updated_at=2023-08-31 00:33:16, modality_id='Rkh7Kh9E', created_by_id='DzTjkKse')
✅ linked new feature 'species' together with new feature set FeatureSet(id='hvSLpsFsLAqafGjjkyI6', n=2, registry='core.Feature', hash='0gUpeO4ClC0Hhhygx6HB', updated_at=2023-08-31 00:33:16, modality_id='Rkh7Kh9E', created_by_id='DzTjkKse')
var_feature_set = file.features.get_feature_set("var")
var_feature_set.modality = modality
var_feature_set.save()
file.features
'var': FeatureSet(id='7QCcV3DoHj2jx4shgdlS', n=35, type='float', registry='bionty.CellMarker', hash='ldY9_GmptHLCcT7Nrpgo', updated_at=2023-08-31 00:33:16, modality_id='kIzH3RJp', created_by_id='DzTjkKse')
'obs': FeatureSet(id='MBDhHQxRTnN1J6Lt22oA', n=5, registry='core.Feature', hash='Ji1mZjV_jv2kGaWJrobX', updated_at=2023-08-31 00:33:16, modality_id='Rkh7Kh9E', created_by_id='DzTjkKse')
'external': FeatureSet(id='hvSLpsFsLAqafGjjkyI6', n=2, registry='core.Feature', hash='0gUpeO4ClC0Hhhygx6HB', updated_at=2023-08-31 00:33:16, modality_id='Rkh7Kh9E', created_by_id='DzTjkKse')

Check a few validated cell markers in .var:

file.features["var"].df().head(10)
name synonyms gene_symbol ncbi_gene_id uniprotkb_id species_id bionty_source_id updated_at created_by_id
id
ljp5UfCF9HCi TCRgd TCRGAMMADELTA|TCRγδ None None None uHJU qb2y 2023-08-31 00:33:12 DzTjkKse
Nb2sscq9cBcB CD57 B3GAT1 27087 Q9P2W7 uHJU qb2y 2023-08-31 00:33:12 DzTjkKse
a4hvNp34IYP0 CD3 None None None uHJU qb2y 2023-08-31 00:33:12 DzTjkKse
N2F6Qv9CxJch CD11B ITGAM 3684 P11215 uHJU qb2y 2023-08-31 00:33:12 DzTjkKse
0qCmUijBeByY CD94 KLRD1 3824 Q13241 uHJU qb2y 2023-08-31 00:33:12 DzTjkKse
hVNEgxlcDV10 CD127 IL7R 3575 P16871 uHJU qb2y 2023-08-31 00:33:12 DzTjkKse
2VeZenLi2dj5 PD1 PID1|PD-1|PD 1 PDCD1 5133 A0A0M3M0G7 uHJU qb2y 2023-08-31 00:33:12 DzTjkKse
50v4SaR2m5zQ CD25 IL2RA 3559 P01589 uHJU qb2y 2023-08-31 00:33:12 DzTjkKse
gEfe8qTsIHl0 CD24 CD24 100133941 B6EC88 uHJU qb2y 2023-08-31 00:33:12 DzTjkKse
ttBc0Fs01sYk CD8 CD8A 925 P01732 uHJU qb2y 2023-08-31 00:33:12 DzTjkKse

FlowIO sample#

Let’s transform, validate and register another flow file:

Transform #

There are no further transformations necessary.

adata2 = readfcs.read(ln.dev.datasets.file_fcs())

Validate #

We’d like to track all features in .var, so we register them:

adata2.var.index = lb.CellMarker.bionty().standardize(adata2.var.index)
💡 standardized 14/16 terms
markers = lb.CellMarker.from_values(adata2.var.index, "name")
ln.save(markers)
✅ loaded 10 CellMarker records matching name: CD3, CD28, CD8, Cd4, CD57, Cd14, Cd19, CD27, Ccr7, CD127
✅ created 4 CellMarker records from Bionty matching name: CCR5, CD45RO, Ki67, SSC-A
did not create CellMarker records for 2 non-validated names: FSC-A, FSC-H

Standardize synonyms so that all features pass validation:

adata2.var.index = lb.CellMarker.standardize(adata2.var.index)
💡 standardized 14/16 terms
lb.CellMarker.validate(adata2.var.index, "name");
14 terms (87.50%) are validated for name
2 terms (12.50%) are not validated for name: FSC-A, FSC-H

Register #

file2 = ln.File.from_anndata(
    adata2, description="My fcs file", var_ref=lb.CellMarker.name
)
💡 file will be copied to default storage upon `save()` with key `None` ('.lamindb/0aSp7Tesy9yOPoLrEilW.h5ad')
💡 parsing feature names of X stored in slot 'var'
14 terms (87.50%) are validated for name
2 terms (12.50%) are not validated for name: FSC-A, FSC-H
✅    linked: FeatureSet(id='uGQpF1iJVNBUBtuNGl1S', n=14, type='float', registry='bionty.CellMarker', hash='npy5P7AYbjKLInpXlNvb', created_by_id='DzTjkKse')
file2.save()
✅ saved 1 feature set for slot: 'var'
✅ storing file '0aSp7Tesy9yOPoLrEilW' at '.lamindb/0aSp7Tesy9yOPoLrEilW.h5ad'
file2.add_labels(facs, "assay")
file2.add_labels(lb.settings.species, "species")
✅ linked new feature 'assay' together with new feature set FeatureSet(id='E20DXLnp8xpoiGXDIEhe', n=1, registry='core.Feature', hash='kGU8J8-duA5oc0mrtTkd', updated_at=2023-08-31 00:33:21, modality_id='Rkh7Kh9E', created_by_id='DzTjkKse')
✅ loaded: FeatureSet(id='hvSLpsFsLAqafGjjkyI6', n=2, registry='core.Feature', hash='0gUpeO4ClC0Hhhygx6HB', updated_at=2023-08-31 00:33:16, modality_id='Rkh7Kh9E', created_by_id='DzTjkKse')
✅ linked new feature 'species' together with new feature set FeatureSet(id='hvSLpsFsLAqafGjjkyI6', n=2, registry='core.Feature', hash='0gUpeO4ClC0Hhhygx6HB', updated_at=2023-08-31 00:33:21, modality_id='Rkh7Kh9E', created_by_id='DzTjkKse')
var_feature_set = file2.features.get_feature_set("var")
var_feature_set.modality = modality
var_feature_set.save()
file2.features
'var': FeatureSet(id='uGQpF1iJVNBUBtuNGl1S', n=14, type='float', registry='bionty.CellMarker', hash='npy5P7AYbjKLInpXlNvb', updated_at=2023-08-31 00:33:21, modality_id='kIzH3RJp', created_by_id='DzTjkKse')
'external': FeatureSet(id='hvSLpsFsLAqafGjjkyI6', n=2, registry='core.Feature', hash='0gUpeO4ClC0Hhhygx6HB', updated_at=2023-08-31 00:33:21, modality_id='Rkh7Kh9E', created_by_id='DzTjkKse')
file2.view_flow()
https://d33wubrfki0l68.cloudfront.net/fe94c4fadbfa16a4707230bbff5c03c617c3896b/a800e/_images/08c828427cd56b694f0b46a22d7d40ad41f4d96f68195e99aba793904d708214.svg

Query by cell markers #

Which datasets have CD14 in the flow panel:

cell_markers = lb.CellMarker.lookup()
cell_markers.cd14
CellMarker(id='roEbL8zuLC5k', name='Cd14', synonyms='', gene_symbol='CD14', ncbi_gene_id='4695', uniprotkb_id='O43678', updated_at=2023-08-31 00:33:12, species_id='uHJU', bionty_source_id='qb2y', created_by_id='DzTjkKse')
panels_with_cd14 = ln.FeatureSet.filter(cell_markers=cell_markers.cd14).all()
ln.File.filter(feature_sets__in=panels_with_cd14).df()
storage_id key suffix accessor description version initial_version_id size hash hash_type transform_id run_id updated_at created_by_id
id
5OYjUKWnLnHsRjqirdpn NiTcUlxC None .h5ad AnnData Alpert19 None None 33367624 14w5ElNsR_MqdiJtvnS1aw md5 OWuTtS4SAponz8 brWmI66Jm8AhdPJqkSMs 2023-08-31 00:33:16 DzTjkKse
0aSp7Tesy9yOPoLrEilW NiTcUlxC None .h5ad AnnData My fcs file None None 6876232 Cf4Fhfw_RDMtKd5amM6Gtw md5 OWuTtS4SAponz8 brWmI66Jm8AhdPJqkSMs 2023-08-31 00:33:21 DzTjkKse

Shared cell markers between two files:

files = ln.File.filter(feature_sets__in=panels_with_cd14, species__name="human").list()
file1, file2 = files[0], files[1]
file1_markers = file1.features["var"]
file2_markers = file2.features["var"]

shared_markers = file1_markers & file2_markers
shared_markers.list("name")
['CD57', 'CD3', 'CD127', 'Cd4', 'CD28', 'CD27', 'CD8', 'Cd19', 'Ccr7', 'Cd14']

Flow marker registry#

Check out your CellMarker registry:

lb.CellMarker.filter().df()
name synonyms gene_symbol ncbi_gene_id uniprotkb_id species_id bionty_source_id updated_at created_by_id
id
ljp5UfCF9HCi TCRgd TCRGAMMADELTA|TCRγδ None None None uHJU qb2y 2023-08-31 00:33:12 DzTjkKse
Nb2sscq9cBcB CD57 B3GAT1 27087 Q9P2W7 uHJU qb2y 2023-08-31 00:33:12 DzTjkKse
a4hvNp34IYP0 CD3 None None None uHJU qb2y 2023-08-31 00:33:12 DzTjkKse
N2F6Qv9CxJch CD11B ITGAM 3684 P11215 uHJU qb2y 2023-08-31 00:33:12 DzTjkKse
0qCmUijBeByY CD94 KLRD1 3824 Q13241 uHJU qb2y 2023-08-31 00:33:12 DzTjkKse
hVNEgxlcDV10 CD127 IL7R 3575 P16871 uHJU qb2y 2023-08-31 00:33:12 DzTjkKse
2VeZenLi2dj5 PD1 PID1|PD-1|PD 1 PDCD1 5133 A0A0M3M0G7 uHJU qb2y 2023-08-31 00:33:12 DzTjkKse
50v4SaR2m5zQ CD25 IL2RA 3559 P01589 uHJU qb2y 2023-08-31 00:33:12 DzTjkKse
gEfe8qTsIHl0 CD24 CD24 100133941 B6EC88 uHJU qb2y 2023-08-31 00:33:12 DzTjkKse
ttBc0Fs01sYk CD8 CD8A 925 P01732 uHJU qb2y 2023-08-31 00:33:12 DzTjkKse
L0WKZ3fufq0J CD11c ITGAX 3687 P20702 uHJU qb2y 2023-08-31 00:33:12 DzTjkKse
sYcK7uoWCtco Ccr7 CCR7 1236 P32248 uHJU qb2y 2023-08-31 00:33:12 DzTjkKse
a624IeIqbchl CD45RA None None None uHJU qb2y 2023-08-31 00:33:12 DzTjkKse
L0m6f7FPiDeg CD86 CD86 942 A8K632 uHJU qb2y 2023-08-31 00:33:12 DzTjkKse
lRZYuH929QDw CD85j None None None uHJU qb2y 2023-08-31 00:33:12 DzTjkKse
cFJEI6e6wml3 CD20 MS4A1 931 A0A024R507 uHJU qb2y 2023-08-31 00:33:12 DzTjkKse
4uiPHmCPV5i1 CXCR5 CXCR5 643 A0N0R2 uHJU qb2y 2023-08-31 00:33:12 DzTjkKse
0vAls2cmLKWq ICOS ICOS 29851 Q53QY6 uHJU qb2y 2023-08-31 00:33:12 DzTjkKse
n40112OuX7Cq CD123 IL3RA 3563 P26951 uHJU qb2y 2023-08-31 00:33:12 DzTjkKse
YA5Ezh6SAy10 DNA1 None None None uHJU qb2y 2023-08-31 00:33:12 DzTjkKse
k0zGbSgZEX3q HLADR HLA‐DR|HLA-DR|HLA DR None None None uHJU qb2y 2023-08-31 00:33:12 DzTjkKse
yCyTIVxZkIUz DNA2 DNA2 1763 P51530 uHJU qb2y 2023-08-31 00:33:12 DzTjkKse
uThe3c0V3d4i CD27 CD27 939 P26842 uHJU qb2y 2023-08-31 00:33:12 DzTjkKse
bspnQ0igku6c CD16 FCGR3A 2215 O75015 uHJU qb2y 2023-08-31 00:33:12 DzTjkKse
8OhpfB7wwV32 Cd19 CD19 930 P15391 uHJU qb2y 2023-08-31 00:33:12 DzTjkKse
c3dZKHFOdllB CD33 CD33 945 P20138 uHJU qb2y 2023-08-31 00:33:12 DzTjkKse
fpPkjlGv15C9 Ccr6 CCR6 1235 P51684 uHJU qb2y 2023-08-31 00:33:12 DzTjkKse
0evamYEdmaoY Igd None None None uHJU qb2y 2023-08-31 00:33:12 DzTjkKse
4EojtgN0CjBH CD161 KLRB1 3820 Q12918 uHJU qb2y 2023-08-31 00:33:12 DzTjkKse
HEK41hvaIazP Cd4 CD4 920 B4DT49 uHJU qb2y 2023-08-31 00:33:12 DzTjkKse
CLFUvJpioHoA CD28 CD28 940 B4E0L1 uHJU qb2y 2023-08-31 00:33:12 DzTjkKse
agQD0dEzuoNA CXCR3 CXCR3 2833 P49682 uHJU qb2y 2023-08-31 00:33:12 DzTjkKse
h4rkCALR5WfU CD56 NCAM1 4684 P13591 uHJU qb2y 2023-08-31 00:33:12 DzTjkKse
roEbL8zuLC5k Cd14 CD14 4695 O43678 uHJU qb2y 2023-08-31 00:33:12 DzTjkKse
CR7DAHxybgyi CD38 CD38 952 B4E006 uHJU qb2y 2023-08-31 00:33:12 DzTjkKse
UMsp5g0fgMwY CCR5 CCR5 1234 P51681 uHJU qb2y 2023-08-31 00:33:20 DzTjkKse
Qa4ozz9tyesQ Ki67 Ki-67|KI 67 None None None uHJU qb2y 2023-08-31 00:33:20 DzTjkKse
VZBURNy04vBi SSC-A SSC A|SSCA None None None uHJU qb2y 2023-08-31 00:33:20 DzTjkKse
XvpJ6oL3SG7w CD45RO None None None uHJU qb2y 2023-08-31 00:33:20 DzTjkKse
Hide code cell content
# a few tests
assert set(shared_markers.list("name")) == set(
    [
        "Ccr7",
        "CD3",
        "Cd14",
        "Cd19",
        "CD127",
        "CD27",
        "CD28",
        "CD8",
        "Cd4",
        "CD57",
    ]
)
ln.File.filter(feature_sets__in=panels_with_cd14).exists()
True
Hide code cell content
# clean up test instance
!lamin delete --force test-flow
!rm -r test-flow
💡 deleting instance testuser1/test-flow
✅     deleted instance settings file: /home/runner/.lamin/instance--testuser1--test-flow.env
✅     instance cache deleted
✅     deleted '.lndb' sqlite file
❗     consider manually deleting your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/test-flow