Protein: uniprot, 2024-03

Track current notebook

!lamin load laminlabs/bionty-assets
馃挕 connected lamindb: laminlabs/bionty-assets
import lamindb as ln
import bionty as bt

ln.settings.transform.stem_uid = "Aas9oWTlUts6"
ln.settings.transform.version = "1"
run = ln.track()

new_ontology = ln.ULabel.filter(name="new_ontology").one()
run.transform.ulabels.add(new_ontology)
馃挕 connected lamindb: laminlabs/bionty-assets
馃挕 notebook imports: bionty==0.48.0 lamindb==0.75.0 pandas==2.0.0
馃挕 loaded: Transform(uid='Aas9oWTlUts65zKv', version='1', name='`Protein`: uniprot, 2024-03', key='protein-uniprot-2024-03', type='notebook', created_by_id=1, updated_at='2024-08-13 13:28:42 UTC')
馃挕 loaded: Run(uid='H98Sd5WQoG1Kvs4vqOG0', started_at='2024-08-13 14:32:30 UTC', is_consecutive=True, transform_id=3, created_by_id=1)

Curate source

import pandas as pd
import re
def parse_protein_names(protein_names):
    # Split the string by parentheses or semicolons
    names = re.split(r'\(|\)|\s*;\s*', protein_names)
    # Remove empty strings and strip whitespace
    names = [name.strip() for name in names if name.strip()]
    
    if names:
        recommended_name = names[0]
        synonyms = names[1:] if len(names) > 1 else []
        return recommended_name, synonyms
    else:
        return '', []
# Files are downloaded from: https://www.uniprot.org/uniprotkb
version = "2024-03"
filepaths = {
    "human": f"https://bionty-assets.s3.amazonaws.com/uniprot-human-{version}.tsv.gz",
    "mouse": f"https://bionty-assets.s3.amazonaws.com/uniprot-mouse-{version}.tsv.gz",
}
df_filenames = {}

for organism, filepath in filepaths.items():
    print(f"Loading {organism} data...")

    df = pd.read_csv(filepath, sep="\t")

    print(f"shape: {df.shape}")
    display(df.head())

    df['name'], df['Synonyms'] = zip(*df['Protein names'].apply(parse_protein_names))
    df['Synonyms'] = df['Synonyms'].apply(lambda x: '|'.join(x) if x else '')

    df = df.rename(
        columns={
            "Entry": "uniprotkb_id",
            "Synonyms": "synonyms",
            "Length": "length",
            "Gene Names (primary)": "gene_symbol",
            "Ensembl": "ensembl_gene_ids",
        }
    )

    # sort by uniprotkb id, reset index
    df = df[~df["uniprotkb_id"].isnull()]
    df = df.sort_values("uniprotkb_id").reset_index(drop=True)

    split_columns = df["name"].str.split(", |\. |\s\[|: |/", expand=True, regex=True)
    df["name"] = split_columns[0]
    df["description"] = split_columns.loc[:, 1:].apply(lambda x: ', '.join(x.dropna()), axis=1)

    df = df[["uniprotkb_id", "name", "description", "length", "synonyms", "gene_symbol", "ensembl_gene_ids"]]

    print(f"shape: {df.shape}, unique: {df.uniprotkb_id.is_unique}")
    display(df.head())

    filename = f"df_{organism}__uniprot__{version}__Protein.parquet"
    df.to_parquet(filename)
    df_filenames[organism] = filename

    print(f"Wrote {filename}.")
    print("------------------------------------------------")
Loading human data...
shape: (204088, 8)
Entry Reviewed Entry Name Protein names Organism Length Gene Names (primary) Ensembl
0 A0A024R1X5 unreviewed A0A024R1X5_HUMAN Beclin-1 Homo sapiens (Human) 450 BECN1 NaN
1 A0A024R274 unreviewed A0A024R274_HUMAN Mothers against decapentaplegic homolog (MAD h... Homo sapiens (Human) 552 SMAD4 NaN
2 A0A024R324 unreviewed A0A024R324_HUMAN Transforming protein RhoA Homo sapiens (Human) 193 RHOA NaN
3 A0A024R5Z7 unreviewed A0A024R5Z7_HUMAN Annexin Homo sapiens (Human) 339 ANXA2 NaN
4 A0A024R6A3 unreviewed A0A024R6A3_HUMAN Presenilin (EC 3.4.23.-) Homo sapiens (Human) 467 PSEN1 NaN
shape: (204088, 7), unique: True
uniprotkb_id name description length synonyms gene_symbol ensembl_gene_ids
0 A0A023HJ61 Ras-related protein Rab-4A 121 RAB4A NaN
1 A0A023HN28 SRSF3 USP6 fusion protein 16 NaN NaN
2 A0A023I7F4 Cytochrome b 380 CYTB NaN
3 A0A023I7H2 NADH-ubiquinone oxidoreductase chain 5 603 EC 7.1.1.2 ND5 NaN
4 A0A023I7H5 ATP synthase subunit a 226 ATP6 NaN
Wrote df_human__uniprot__2024-03__Protein.parquet.
------------------------------------------------
Loading mouse data...
shape: (85830, 8)
Entry Reviewed Entry Name Protein names Organism Length Ensembl Gene Names (primary)
0 A0A075F5C6 unreviewed A0A075F5C6_MOUSE Heat shock factor 1 (Heat shock transcription ... Mus musculus (Mouse) 531 ENSMUST00000228371.2; Hsf1
1 A0A087WPF7 reviewed AUTS2_MOUSE Autism susceptibility gene 2 protein homolog Mus musculus (Mouse) 1261 ENSMUST00000161226.11 [A0A087WPF7-1];ENSMUST00... Auts2
2 A0A087WPU4 unreviewed A0A087WPU4_MOUSE FAT atypical cadherin 1 Mus musculus (Mouse) 159 ENSMUST00000186342.3; Fat1
3 A0A087WRK1 unreviewed A0A087WRK1_MOUSE Predicted gene, 20814 (Predicted gene, 20855) ... Mus musculus (Mouse) 222 ENSMUST00000185240.2;ENSMUST00000185245.2;ENSM... Gm20905
4 A0A087WRT4 unreviewed A0A087WRT4_MOUSE FAT atypical cadherin 1 Mus musculus (Mouse) 4602 ENSMUST00000189017.8; Fat1
shape: (85830, 7), unique: True
uniprotkb_id name description length synonyms gene_symbol ensembl_gene_ids
0 A0A023JDV8 Creatine transporter SLC6A8 variant D 224 Slc6a8 NaN
1 A0A023NCR8 Cytochrome b 233 Complex III subunit 3|Complex III subunit III|... cytB NaN
2 A0A023NCS0 Cytochrome b 222 Complex III subunit 3|Complex III subunit III|... cytB NaN
3 A0A023ND59 Cytochrome b 227 Complex III subunit 3|Complex III subunit III|... cytB NaN
4 A0A023NDP0 Cytochrome b 242 Complex III subunit 3|Complex III subunit III|... cytB NaN
Wrote df_mouse__uniprot__2024-03__Protein.parquet.
------------------------------------------------

Register in laminlabs/bionty-assets

Important

Please make sure the source_record has been added to the bionty.Source registry!

  1. Modify the source.yaml file in bionty.base to add the new source

  2. Load laminlabs/bionty-assets and run bionty.core.sync_all_sources_to_latest()

  3. Reload the instance via lamin load laminlabs/bionty-assets

from bionty.core._bionty import register_source_in_bionty_assets
df_filenames
{'human': 'df_human__uniprot__2024-03__Protein.parquet',
 'mouse': 'df_mouse__uniprot__2024-03__Protein.parquet'}
for organism, filename in df_filenames.items():
    source_record = bt.Source.filter(name="uniprot", organism=organism, version=version, entity="bionty.Protein").one()
    register_source_in_bionty_assets(filepath=filename, source=source_record)
馃挕 returning existing artifact with same hash: Artifact(uid='4OH11KRwXhIN0NbiAJpF', key='df_human__uniprot__2024-03__Protein.parquet', suffix='.parquet', size=6221769, hash='tbnnZFBltLMYcRTwfj9ALw', _hash_type='md5', visibility=1, _key_is_virtual=False, created_by_id=1, storage_id=1, transform_id=2, run_id=2, updated_at='2024-08-13 12:05:46 UTC')
馃挕 returning existing artifact with same hash: Artifact(uid='R1nwLHai3OaaxdWw32gJ', key='df_mouse__uniprot__2024-03__Protein.parquet', suffix='.parquet', size=3298948, hash='sbahluuFMIjTYZjY43SexA', _hash_type='md5', visibility=1, _key_is_virtual=False, created_by_id=1, storage_id=1, transform_id=2, run_id=2, updated_at='2024-08-13 12:05:58 UTC')
ln.finish()