`CellLine`: clo; 2022-03-21¶

The owl files are missing metadata including definition and synonyms for clo, so we manually parse them from the csv file.

Download clo.csv.gz from: https://data.bioontology.org/ontologies/CLO/download?apikey=8b5b7825-538d-40e0-9e9e-5ab9274a9aeb&download_format=csv https://bioportal.bioontology.org/ontologies/CLO

import pandas as pd


def df_from_csv(csv_filepath, prefix):
    df = pd.read_csv(csv_filepath)
    # df = df[~df["Obsolete"]]
    df["ontology_id"] = (
        df["Class ID"]
        .str.replace("http://purl.obolibrary.org/obo/", "")
        .str.replace("_", ":")
    )
    df = df[df["ontology_id"].str.startswith("CLO")]
    df.drop(columns=["definition"], inplace=True)
    df.rename(
        columns={
            "Preferred Label": "name",
            "Synonyms": "synonyms",
            "Definitions": "definition",
            "Parents": "parents",
        },
        inplace=True,
    )
    parents = []
    for p in df["parents"]:
        try:
            plist = [
                i
                for i in p.replace("http://purl.obolibrary.org/obo/", "")
                .replace("_", ":")
                .split("|")
                if i.startswith(prefix)
            ]
            parents.append(plist)
        except AttributeError:
            parents.append([])
    df["parents"] = parents
    df = df[["ontology_id", "name", "definition", "synonyms", "parents"]]
    df = df.sort_values("ontology_id")

    # drop duplicated names, keep the last record
    df = df.drop_duplicates("name", keep="last")

    return df.set_index("ontology_id")

df = df_from_csv("clo.csv.gz", "CLO")

/var/folders/m8/s9fnpvhj7qsgng70w8xpts_m0000gn/T/ipykernel_29020/626069511.py:5: DtypeWarning: Columns (8,9,10,12,13,15,17,18,20,23,35,39,40,41,42,43,46,48,49,50,51,53,54,55,56,57,60,63,64,65,70,71,72,77,82,88,89,92,98,99,101,104,105,108,110,111,112,113,115,116,117,118,124,126,127,128,131,132,135,136,137,139,140,143,144,145,149,150,151,152,153,154,158,159,160,161,164,165,166,168,169,170,171,172,173,174,175,178,181,182,184,186,189,190,197,198,199,200,201,202,204,205,206,209,210,211,212,213,215,216,219,220,221,246,257,258,260,261,263,269,270,272,273,274,276,278,284,292,296,297,299,300,303,305,313,316,318,319,322,324,326,327,328,330,333,334,335,336,338,339,340,341,342,343,344,345,346,348,350,352,355,356,359,360,361,362,363,364,365,366,367,368,369,370,372,375,376,377,380,382,383,384,385,387,388,389,390,391,396,397,400,403,404) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv(csv_filepath)
/var/folders/m8/s9fnpvhj7qsgng70w8xpts_m0000gn/T/ipykernel_29020/626069511.py:8: FutureWarning: The default value of regex will change from True to False in a future version.
  df["Class ID"]

df

	name	definition	synonyms	parents
ontology_id
CLO:0000000	cell line cell culturing	a maintaining cell culture process that keeps ...	NaN	[]
CLO:0000001	cell line cell	A cultured cell that is part of a cell line - ...	NaN	[]
CLO:0000002	suspension cell line culturing	suspension cell line culturing is a cell line ...	NaN	[CLO:0000000]
CLO:0000003	adherent cell line culturing	adherent cell line culturing is a cell line cu...	NaN	[CLO:0000000]
CLO:0000004	cell line cell modification	a material processing that modifies an existin...	NaN	[]
...	...	...	...	...
CLO:0051617	RCB0187 cell	A immortal medaka cell line cell that has the ...	RCB0187\|OLHE-131	[CLO:0009822]
CLO:0051618	RCB2945 cell	A immortal medaka cell line cell that has the ...	RCB2945\|DIT29	[CLO:0009822]
CLO:0051619	RCB0184 cell	A immortal medaka cell line cell that has the ...	OLF-136\|RCB0184	[CLO:0009822]
CLO:0051620	RCB0188 cell	A immortal medaka cell line cell that has the ...	RCB0188\|OLME-104	[CLO:0009822]
CLO:0051621	RCB2319 cell	A immortal cell line cell that has the charact...	LACF-NaNaI\|RCB2319	[CLO:0000019]

39037 rows × 4 columns

df.loc["CLO:0007050"]

name                                          K 562 cell
definition            disease: leukemia, chronic myeloid
synonyms      K-562|KO|GM05372E|K.562|K562|GM05372|K 562
parents                                    [CLO:0000511]
Name: CLO:0007050, dtype: object

# adding RPE1 and RPE to synonyms as it's used quite often

df.loc["CLO:0004290"]["synonyms"] += "|RPE1|RPE-1|RPE"

df.loc["CLO:0004290"]["synonyms"]

'hTERT RPE-1|RPE1|RPE-1|RPE'

df.to_parquet("df_all__clo__2022-03-21__CellLine.parquet")

CellLine: clo; 2022-03-21¶

`CellLine`: clo; 2022-03-21¶