Protein
: uniprot, 2024-03露
Track current notebook露
!lamin load laminlabs/bionty-assets
馃挕 connected lamindb: laminlabs/bionty-assets
import lamindb as ln
import bionty as bt
ln.settings.transform.stem_uid = "Aas9oWTlUts6"
ln.settings.transform.version = "1"
run = ln.track()
new_ontology = ln.ULabel.filter(name="new_ontology").one()
run.transform.ulabels.add(new_ontology)
馃挕 connected lamindb: laminlabs/bionty-assets
馃挕 notebook imports: bionty==0.48.0 lamindb==0.75.0 pandas==2.0.0
馃挕 loaded: Transform(uid='Aas9oWTlUts65zKv', version='1', name='`Protein`: uniprot, 2024-03', key='protein-uniprot-2024-03', type='notebook', created_by_id=1, updated_at='2024-08-13 13:28:42 UTC')
馃挕 loaded: Run(uid='H98Sd5WQoG1Kvs4vqOG0', started_at='2024-08-13 14:32:30 UTC', is_consecutive=True, transform_id=3, created_by_id=1)
Curate source露
import pandas as pd
import re
def parse_protein_names(protein_names):
# Split the string by parentheses or semicolons
names = re.split(r'\(|\)|\s*;\s*', protein_names)
# Remove empty strings and strip whitespace
names = [name.strip() for name in names if name.strip()]
if names:
recommended_name = names[0]
synonyms = names[1:] if len(names) > 1 else []
return recommended_name, synonyms
else:
return '', []
# Files are downloaded from: https://www.uniprot.org/uniprotkb
version = "2024-03"
filepaths = {
"human": f"https://bionty-assets.s3.amazonaws.com/uniprot-human-{version}.tsv.gz",
"mouse": f"https://bionty-assets.s3.amazonaws.com/uniprot-mouse-{version}.tsv.gz",
}
df_filenames = {}
for organism, filepath in filepaths.items():
print(f"Loading {organism} data...")
df = pd.read_csv(filepath, sep="\t")
print(f"shape: {df.shape}")
display(df.head())
df['name'], df['Synonyms'] = zip(*df['Protein names'].apply(parse_protein_names))
df['Synonyms'] = df['Synonyms'].apply(lambda x: '|'.join(x) if x else '')
df = df.rename(
columns={
"Entry": "uniprotkb_id",
"Synonyms": "synonyms",
"Length": "length",
"Gene Names (primary)": "gene_symbol",
"Ensembl": "ensembl_gene_ids",
}
)
# sort by uniprotkb id, reset index
df = df[~df["uniprotkb_id"].isnull()]
df = df.sort_values("uniprotkb_id").reset_index(drop=True)
split_columns = df["name"].str.split(", |\. |\s\[|: |/", expand=True, regex=True)
df["name"] = split_columns[0]
df["description"] = split_columns.loc[:, 1:].apply(lambda x: ', '.join(x.dropna()), axis=1)
df = df[["uniprotkb_id", "name", "description", "length", "synonyms", "gene_symbol", "ensembl_gene_ids"]]
print(f"shape: {df.shape}, unique: {df.uniprotkb_id.is_unique}")
display(df.head())
filename = f"df_{organism}__uniprot__{version}__Protein.parquet"
df.to_parquet(filename)
df_filenames[organism] = filename
print(f"Wrote {filename}.")
print("------------------------------------------------")
Loading human data...
shape: (204088, 8)
Entry | Reviewed | Entry Name | Protein names | Organism | Length | Gene Names (primary) | Ensembl | |
---|---|---|---|---|---|---|---|---|
0 | A0A024R1X5 | unreviewed | A0A024R1X5_HUMAN | Beclin-1 | Homo sapiens (Human) | 450 | BECN1 | NaN |
1 | A0A024R274 | unreviewed | A0A024R274_HUMAN | Mothers against decapentaplegic homolog (MAD h... | Homo sapiens (Human) | 552 | SMAD4 | NaN |
2 | A0A024R324 | unreviewed | A0A024R324_HUMAN | Transforming protein RhoA | Homo sapiens (Human) | 193 | RHOA | NaN |
3 | A0A024R5Z7 | unreviewed | A0A024R5Z7_HUMAN | Annexin | Homo sapiens (Human) | 339 | ANXA2 | NaN |
4 | A0A024R6A3 | unreviewed | A0A024R6A3_HUMAN | Presenilin (EC 3.4.23.-) | Homo sapiens (Human) | 467 | PSEN1 | NaN |
shape: (204088, 7), unique: True
uniprotkb_id | name | description | length | synonyms | gene_symbol | ensembl_gene_ids | |
---|---|---|---|---|---|---|---|
0 | A0A023HJ61 | Ras-related protein Rab-4A | 121 | RAB4A | NaN | ||
1 | A0A023HN28 | SRSF3 | USP6 fusion protein | 16 | NaN | NaN | |
2 | A0A023I7F4 | Cytochrome b | 380 | CYTB | NaN | ||
3 | A0A023I7H2 | NADH-ubiquinone oxidoreductase chain 5 | 603 | EC 7.1.1.2 | ND5 | NaN | |
4 | A0A023I7H5 | ATP synthase subunit a | 226 | ATP6 | NaN |
Wrote df_human__uniprot__2024-03__Protein.parquet.
------------------------------------------------
Loading mouse data...
shape: (85830, 8)
Entry | Reviewed | Entry Name | Protein names | Organism | Length | Ensembl | Gene Names (primary) | |
---|---|---|---|---|---|---|---|---|
0 | A0A075F5C6 | unreviewed | A0A075F5C6_MOUSE | Heat shock factor 1 (Heat shock transcription ... | Mus musculus (Mouse) | 531 | ENSMUST00000228371.2; | Hsf1 |
1 | A0A087WPF7 | reviewed | AUTS2_MOUSE | Autism susceptibility gene 2 protein homolog | Mus musculus (Mouse) | 1261 | ENSMUST00000161226.11 [A0A087WPF7-1];ENSMUST00... | Auts2 |
2 | A0A087WPU4 | unreviewed | A0A087WPU4_MOUSE | FAT atypical cadherin 1 | Mus musculus (Mouse) | 159 | ENSMUST00000186342.3; | Fat1 |
3 | A0A087WRK1 | unreviewed | A0A087WRK1_MOUSE | Predicted gene, 20814 (Predicted gene, 20855) ... | Mus musculus (Mouse) | 222 | ENSMUST00000185240.2;ENSMUST00000185245.2;ENSM... | Gm20905 |
4 | A0A087WRT4 | unreviewed | A0A087WRT4_MOUSE | FAT atypical cadherin 1 | Mus musculus (Mouse) | 4602 | ENSMUST00000189017.8; | Fat1 |
shape: (85830, 7), unique: True
uniprotkb_id | name | description | length | synonyms | gene_symbol | ensembl_gene_ids | |
---|---|---|---|---|---|---|---|
0 | A0A023JDV8 | Creatine transporter SLC6A8 variant D | 224 | Slc6a8 | NaN | ||
1 | A0A023NCR8 | Cytochrome b | 233 | Complex III subunit 3|Complex III subunit III|... | cytB | NaN | |
2 | A0A023NCS0 | Cytochrome b | 222 | Complex III subunit 3|Complex III subunit III|... | cytB | NaN | |
3 | A0A023ND59 | Cytochrome b | 227 | Complex III subunit 3|Complex III subunit III|... | cytB | NaN | |
4 | A0A023NDP0 | Cytochrome b | 242 | Complex III subunit 3|Complex III subunit III|... | cytB | NaN |
Wrote df_mouse__uniprot__2024-03__Protein.parquet.
------------------------------------------------
Register in laminlabs/bionty-assets
露
Important
Please make sure the source_record has been added to the bionty.Source
registry!
Modify the
source.yaml
file in bionty.base to add the new sourceLoad
laminlabs/bionty-assets
and runbionty.core.sync_all_sources_to_latest()
Reload the instance via
lamin load laminlabs/bionty-assets
from bionty.core._bionty import register_source_in_bionty_assets
df_filenames
{'human': 'df_human__uniprot__2024-03__Protein.parquet',
'mouse': 'df_mouse__uniprot__2024-03__Protein.parquet'}
for organism, filename in df_filenames.items():
source_record = bt.Source.filter(name="uniprot", organism=organism, version=version, entity="bionty.Protein").one()
register_source_in_bionty_assets(filepath=filename, source=source_record)
馃挕 returning existing artifact with same hash: Artifact(uid='4OH11KRwXhIN0NbiAJpF', key='df_human__uniprot__2024-03__Protein.parquet', suffix='.parquet', size=6221769, hash='tbnnZFBltLMYcRTwfj9ALw', _hash_type='md5', visibility=1, _key_is_virtual=False, created_by_id=1, storage_id=1, transform_id=2, run_id=2, updated_at='2024-08-13 12:05:46 UTC')
馃挕 returning existing artifact with same hash: Artifact(uid='R1nwLHai3OaaxdWw32gJ', key='df_mouse__uniprot__2024-03__Protein.parquet', suffix='.parquet', size=3298948, hash='sbahluuFMIjTYZjY43SexA', _hash_type='md5', visibility=1, _key_is_virtual=False, created_by_id=1, storage_id=1, transform_id=2, run_id=2, updated_at='2024-08-13 12:05:58 UTC')
ln.finish()