UniProtKB table -> bionty.Protein().df
¶
import pandas as pd
import lamindb as ln
from lnschema_bionty import id
ln.nb.header()
author | Sunny Sun (sunnyosun) |
id | uV9o7RZmv6rG |
version | 1 |
time_init | 2022-09-26 21:17 |
time_run | 2022-10-25 16:23 |
consecutive_cells | True |
pypackage | lamindb==0.6.0 lnschema_bionty==0.4.3 pandas==1.5.0 |
Files are downloaded from: https://www.uniprot.org/uniprotkb
# Downloaded from 2022-09-26
filepaths = {
"human": "https://bionty-assets.s3.amazonaws.com/uniprot-human.tsv.gz",
"mouse": "https://bionty-assets.s3.amazonaws.com/uniprot-mouse.tsv.gz",
}
Curate the tables¶
allids = []
for species, filepath in filepaths.items():
print(f"Loading {species} data...")
df = pd.read_csv(filepath, sep="\t")
# add ids to each entry
ids = []
for i in df.index:
ids.append(id.protein())
df.index = ids
df.index.name = "id"
allids += ids
print(f"shape: {df.shape}")
display(df.head())
filename = f"uniprot-{species}.parquet"
df.to_parquet(filename)
print(f"Wrote {filename}.")
print("------------------------------------------------")
assert len(allids) == len(set(allids))
Loading human data...
shape: (204961, 9)
Entry | Entry Name | Protein names | Length | Organism (ID) | Gene Names (primary) | Gene Names (synonym) | Ensembl | GeneID | |
---|---|---|---|---|---|---|---|---|---|
id | |||||||||
1zrr8Wy | A0A024QZ08 | A0A024QZ08_HUMAN | Intraflagellar transport 20 homolog (Chlamydom... | 132 | 9606 | IFT20 | NaN | NaN | 90410; |
xNgxtFu | A0A024QZ86 | A0A024QZ86_HUMAN | T-box 2, isoform CRA_a | 712 | 9606 | TBX2 | NaN | NaN | 6909; |
X9K8OgK | A0A024QZA8 | A0A024QZA8_HUMAN | Receptor protein-tyrosine kinase, EC 2.7.10.1 | 976 | 9606 | EPHA2 | NaN | NaN | 1969; |
8jW9Ci4 | A0A024QZB8 | A0A024QZB8_HUMAN | Battenin | 438 | 9606 | CLN3 | NaN | NaN | 1201; |
nZNsA6F | A0A024QZQ1 | A0A024QZQ1_HUMAN | Sirtuin (Silent mating type information regula... | 747 | 9606 | SIRT1 | NaN | NaN | 23411; |
Wrote uniprot-human.parquet.
------------------------------------------------
Loading mouse data...
shape: (86436, 9)
Entry | Entry Name | Protein names | Length | Organism (ID) | Gene Names (primary) | Gene Names (synonym) | Ensembl | GeneID | |
---|---|---|---|---|---|---|---|---|---|
id | |||||||||
oWysKQr | A0A075F5C6 | A0A075F5C6_MOUSE | Heat shock factor protein 1 (Heat shock transc... | 531 | 10090 | Hsf1 | NaN | ENSMUST00000228371.2; | 15499; |
IGupwHD | A0A087WPF7 | AUTS2_MOUSE | Autism susceptibility gene 2 protein homolog | 1261 | 10090 | Auts2 | Kiaa0442 | ENSMUST00000161226 [A0A087WPF7-1];ENSMUST00000... | NaN |
XrEF1mC | A0A087WPT2 | A0A087WPT2_MOUSE | Prostaglandin G/H synthase 2 | 62 | 10090 | Ptgs2 | NaN | ENSMUST00000190784.2; | NaN |
qACNsPf | A0A087WPU4 | A0A087WPU4_MOUSE | FAT atypical cadherin 1 | 159 | 10090 | Fat1 | NaN | ENSMUST00000186342.3; | NaN |
izgkQbe | A0A087WRK1 | A0A087WRK1_MOUSE | Predicted gene, 20814 (Predicted gene, 20850) ... | 222 | 10090 | Gm20850 | Gm20814 Gm20835 Gm20855 Gm20869 Gm20870 Gm2088... | ENSMUST00000185240.2;ENSMUST00000185245.2;ENSM... | 100042201;100042279;100042594;100861691;108167... |
Wrote uniprot-mouse.parquet.
------------------------------------------------
Push to bionty-assets.lndb¶
!lndb load bionty-assets
migrate-unnecessary
!lndb login sunnyosun
ingest = ln.db.Ingest()
ingest.add("uniprot-human.parquet")
ingest.add("uniprot-mouse.parquet");
ingest.commit()
✅ Cell numbers increase consecutively: Awesome!
2022-10-25 18:22:19,238:INFO - Found credentials in shared credentials file: ~/.aws/credentials
Upload /Users/sunnysun/Documents/repos.nosync/bionty-assets/docs/ingest/uniprot-human.parquet: 1.00
Upload /Users/sunnysun/Documents/repos.nosync/bionty-assets/docs/ingest/uniprot-mouse.parquet: 1.00
ℹ️ Added notebook 'UniProtKB table -> `bionty.Protein().df`' (uV9o7RZmv6rG, 1) by user sunnyosun.
✅ Ingested the following dobjects:
+---+-----------------------------------------------+--------------------------------------------------------------+----------------------+
| | dobject | jupynb | user |
+---+-----------------------------------------------+--------------------------------------------------------------+----------------------+
| 0 | uniprot-human.parquet (5WBmdkTO4JCFzPzBcDOJ3) | 'UniProtKB table -> `bionty.Protein().df`' (uV9o7RZmv6rG, 1) | sunnyosun (kmvZDIX9) |
| 1 | uniprot-mouse.parquet (6vgntdGiAbz5bEYP53sma) | 'UniProtKB table -> `bionty.Protein().df`' (uV9o7RZmv6rG, 1) | sunnyosun (kmvZDIX9) |
+---+-----------------------------------------------+--------------------------------------------------------------+----------------------+
ℹ️ Set notebook version to 1 & wrote pypackages.
Now on S3:
human: https://bionty-assets.s3.amazonaws.com/5WBmdkTO4JCFzPzBcDOJ3.parquet
mouse: https://bionty-assets.s3.amazonaws.com/6vgntdGiAbz5bEYP53sma.parquet