`Protein`: uniprot, 2024-03¶

Track current notebook¶

!lamin load laminlabs/bionty-assets

💡 connected lamindb: laminlabs/bionty-assets

import lamindb as ln
import bionty as bt

ln.settings.transform.stem_uid = "Aas9oWTlUts6"
ln.settings.transform.version = "1"
run = ln.track()

new_ontology = ln.ULabel.filter(name="new_ontology").one()
run.transform.ulabels.add(new_ontology)

💡 connected lamindb: laminlabs/bionty-assets
💡 notebook imports: bionty==0.48.0 lamindb==0.75.0 pandas==2.0.0
💡 loaded: Transform(uid='Aas9oWTlUts65zKv', version='1', name='`Protein`: uniprot, 2024-03', key='protein-uniprot-2024-03', type='notebook', created_by_id=1, updated_at='2024-08-13 13:28:42 UTC')
💡 loaded: Run(uid='H98Sd5WQoG1Kvs4vqOG0', started_at='2024-08-13 14:32:30 UTC', is_consecutive=True, transform_id=3, created_by_id=1)

Curate source¶

import pandas as pd
import re

def parse_protein_names(protein_names):
    # Split the string by parentheses or semicolons
    names = re.split(r'\(|\)|\s*;\s*', protein_names)
    # Remove empty strings and strip whitespace
    names = [name.strip() for name in names if name.strip()]
    
    if names:
        recommended_name = names[0]
        synonyms = names[1:] if len(names) > 1 else []
        return recommended_name, synonyms
    else:
        return '', []

# Files are downloaded from: https://www.uniprot.org/uniprotkb
version = "2024-03"
filepaths = {
    "human": f"https://bionty-assets.s3.amazonaws.com/uniprot-human-{version}.tsv.gz",
    "mouse": f"https://bionty-assets.s3.amazonaws.com/uniprot-mouse-{version}.tsv.gz",
}

df_filenames = {}

for organism, filepath in filepaths.items():
    print(f"Loading {organism} data...")

    df = pd.read_csv(filepath, sep="\t")

    print(f"shape: {df.shape}")
    display(df.head())

    df['name'], df['Synonyms'] = zip(*df['Protein names'].apply(parse_protein_names))
    df['Synonyms'] = df['Synonyms'].apply(lambda x: '|'.join(x) if x else '')

    df = df.rename(
        columns={
            "Entry": "uniprotkb_id",
            "Synonyms": "synonyms",
            "Length": "length",
            "Gene Names (primary)": "gene_symbol",
            "Ensembl": "ensembl_gene_ids",
        }
    )

    # sort by uniprotkb id, reset index
    df = df[~df["uniprotkb_id"].isnull()]
    df = df.sort_values("uniprotkb_id").reset_index(drop=True)

    split_columns = df["name"].str.split(", |\. |\s\[|: |/", expand=True, regex=True)
    df["name"] = split_columns[0]
    df["description"] = split_columns.loc[:, 1:].apply(lambda x: ', '.join(x.dropna()), axis=1)

    df = df[["uniprotkb_id", "name", "description", "length", "synonyms", "gene_symbol", "ensembl_gene_ids"]]

    print(f"shape: {df.shape}, unique: {df.uniprotkb_id.is_unique}")
    display(df.head())

    filename = f"df_{organism}__uniprot__{version}__Protein.parquet"
    df.to_parquet(filename)
    df_filenames[organism] = filename

    print(f"Wrote {filename}.")
    print("------------------------------------------------")

Loading human data...
shape: (204088, 8)

	Entry	Reviewed	Entry Name	Protein names	Organism	Length	Gene Names (primary)	Ensembl
0	A0A024R1X5	unreviewed	A0A024R1X5_HUMAN	Beclin-1	Homo sapiens (Human)	450	BECN1	NaN
1	A0A024R274	unreviewed	A0A024R274_HUMAN	Mothers against decapentaplegic homolog (MAD h...	Homo sapiens (Human)	552	SMAD4	NaN
2	A0A024R324	unreviewed	A0A024R324_HUMAN	Transforming protein RhoA	Homo sapiens (Human)	193	RHOA	NaN
3	A0A024R5Z7	unreviewed	A0A024R5Z7_HUMAN	Annexin	Homo sapiens (Human)	339	ANXA2	NaN
4	A0A024R6A3	unreviewed	A0A024R6A3_HUMAN	Presenilin (EC 3.4.23.-)	Homo sapiens (Human)	467	PSEN1	NaN

shape: (204088, 7), unique: True

	uniprotkb_id	name	description	length	synonyms	gene_symbol	ensembl_gene_ids
0	A0A023HJ61	Ras-related protein Rab-4A		121		RAB4A	NaN
1	A0A023HN28	SRSF3	USP6 fusion protein	16		NaN	NaN
2	A0A023I7F4	Cytochrome b		380		CYTB	NaN
3	A0A023I7H2	NADH-ubiquinone oxidoreductase chain 5		603	EC 7.1.1.2	ND5	NaN
4	A0A023I7H5	ATP synthase subunit a		226		ATP6	NaN

Wrote df_human__uniprot__2024-03__Protein.parquet.
------------------------------------------------
Loading mouse data...
shape: (85830, 8)

	Entry	Reviewed	Entry Name	Protein names	Organism	Length	Ensembl	Gene Names (primary)
0	A0A075F5C6	unreviewed	A0A075F5C6_MOUSE	Heat shock factor 1 (Heat shock transcription ...	Mus musculus (Mouse)	531	ENSMUST00000228371.2;	Hsf1
1	A0A087WPF7	reviewed	AUTS2_MOUSE	Autism susceptibility gene 2 protein homolog	Mus musculus (Mouse)	1261	ENSMUST00000161226.11 [A0A087WPF7-1];ENSMUST00...	Auts2
2	A0A087WPU4	unreviewed	A0A087WPU4_MOUSE	FAT atypical cadherin 1	Mus musculus (Mouse)	159	ENSMUST00000186342.3;	Fat1
3	A0A087WRK1	unreviewed	A0A087WRK1_MOUSE	Predicted gene, 20814 (Predicted gene, 20855) ...	Mus musculus (Mouse)	222	ENSMUST00000185240.2;ENSMUST00000185245.2;ENSM...	Gm20905
4	A0A087WRT4	unreviewed	A0A087WRT4_MOUSE	FAT atypical cadherin 1	Mus musculus (Mouse)	4602	ENSMUST00000189017.8;	Fat1

shape: (85830, 7), unique: True

	uniprotkb_id	name	description	length	synonyms	gene_symbol	ensembl_gene_ids
0	A0A023JDV8	Creatine transporter SLC6A8 variant D		224		Slc6a8	NaN
1	A0A023NCR8	Cytochrome b		233	Complex III subunit 3\|Complex III subunit III\|...	cytB	NaN
2	A0A023NCS0	Cytochrome b		222	Complex III subunit 3\|Complex III subunit III\|...	cytB	NaN
3	A0A023ND59	Cytochrome b		227	Complex III subunit 3\|Complex III subunit III\|...	cytB	NaN
4	A0A023NDP0	Cytochrome b		242	Complex III subunit 3\|Complex III subunit III\|...	cytB	NaN

Wrote df_mouse__uniprot__2024-03__Protein.parquet.
------------------------------------------------

Register in `laminlabs/bionty-assets`¶

Important

Please make sure the source_record has been added to the bionty.Source registry!

Modify the source.yaml file in bionty.base to add the new source
Load laminlabs/bionty-assets and run bionty.core.sync_all_sources_to_latest()
Reload the instance via lamin load laminlabs/bionty-assets

from bionty.core._bionty import register_source_in_bionty_assets

df_filenames

{'human': 'df_human__uniprot__2024-03__Protein.parquet',
 'mouse': 'df_mouse__uniprot__2024-03__Protein.parquet'}

for organism, filename in df_filenames.items():
    source_record = bt.Source.filter(name="uniprot", organism=organism, version=version, entity="bionty.Protein").one()
    register_source_in_bionty_assets(filepath=filename, source=source_record)

💡 returning existing artifact with same hash: Artifact(uid='4OH11KRwXhIN0NbiAJpF', key='df_human__uniprot__2024-03__Protein.parquet', suffix='.parquet', size=6221769, hash='tbnnZFBltLMYcRTwfj9ALw', _hash_type='md5', visibility=1, _key_is_virtual=False, created_by_id=1, storage_id=1, transform_id=2, run_id=2, updated_at='2024-08-13 12:05:46 UTC')
💡 returning existing artifact with same hash: Artifact(uid='R1nwLHai3OaaxdWw32gJ', key='df_mouse__uniprot__2024-03__Protein.parquet', suffix='.parquet', size=3298948, hash='sbahluuFMIjTYZjY43SexA', _hash_type='md5', visibility=1, _key_is_virtual=False, created_by_id=1, storage_id=1, transform_id=2, run_id=2, updated_at='2024-08-13 12:05:58 UTC')

ln.finish()

previous

Drug: chebi, 2024-07-27

next

Initialize lamindb/bionty-assets instance