Annotating Gene Sets with Biological Processes Using an LLM#
This notebook shows how to use AnnDictionary’s functions to have LLMs label a set of genes based on the biological process that they may represent. To demonstrate this functionality, we’ll re-annotate some known gene sets from GO Biological Process (GOBP)
Skip the tutorial#
In case you want to skip the tutorial, here’s all the code you need to run this type of annotation.
To annotate a list of genes:#
import anndict as adt
gene_list = ['gene1', 'gene2', 'gene3']
annotation = adt.ai_biological_process(gene_list=gene_list)
To annotate the results of rank_genes_groups in an AnnData object:#
import anndict as adt
#The results will be returned and also stored in `adata.obs['ai_biological_process']`.
annotation_df = adt.ai_annotate_biological_process(adata, groupby='disease vs. control', n_top_genes=10, new_label_column='ai_biological_process')
Begin the tutorial#
import anndict as adt
import gseapy as gp
# Configure LLM backend
adt.configure_llm_backend(provider='anthropic',
model='claude-3-5-sonnet-20240620',
api_key='my-anthropic-api-key',
requests_per_minute=100
)
We’ll use gseapy
to access the GOBP database, and re-annotate some gene lists of known process.
# Download the latest Human GOBP gene set collection (2023 release on Enrichr)
gobp = gp.get_library(name="GO_Biological_Process_2023", organism="Human") # dict: {term: [genes]}
# Inspect how many gene sets were retrieved
print(f"Total GOBP terms: {len(gobp):,}")
Total GOBP terms: 5,406
# Grab any three terms
terms_of_interest = list(gobp.keys())[:3]
# Build a dictionary of the selected gene sets
selected_gene_sets = {term: gobp[term] for term in terms_of_interest}
# Display the gene symbols for each selected term (truncate for readability)
for term, genes in selected_gene_sets.items():
print(f"\n{term} ({len(genes)} genes)")
print(", ".join(genes[:15]), "...") # first 15 genes as a preview
'De Novo' AMP Biosynthetic Process (GO:0044208) (6 genes)
ATIC, PAICS, PFAS, ADSS1, ADSS2, GART ...
'De Novo' Post-Translational Protein Folding (GO:0051084) (32 genes)
SDF2L1, HSPA9, CCT2, HSPA6, ST13, ENTPD5, HSPA1L, HSPA5, PTGES3, HSPA8, HSPA7, DNAJB13, HSPA2, DNAJB14, HSPE1 ...
2-Oxoglutarate Metabolic Process (GO:0006103) (14 genes)
IDH1, PHYH, GOT2, MRPS36, GOT1, IDH2, ADHFE1, GPT2, TAT, DLST, OGDHL, L2HGDH, D2HGDH, OGDH ...
# Now, Let's annotate these known gene sets with an LLM
llm_annotation = {}
for term, genes in selected_gene_sets.items():
llm_annotation[term] = adt.ai_biological_process(gene_list=genes)
/Users/geocr/repos-local/anndict/anndict/llm/base_llm_initializer.py:49: LangChainBetaWarning: Introduced in 0.2.24. API subject to change.
return InMemoryRateLimiter(
/Users/geocr/repos-local/anndict/anndict/llm/llm_manager.py:309: LangChainDeprecationWarning: The method `BaseChatModel.__call__` was deprecated in langchain-core 0.1.7 and will be removed in 1.0. Use invoke instead.
response = llm(langchain_messages, **kwargs)
Now, we can view the results. Here, the keys are the label of the pathway in GOBP, and the values are the LLM-derived label of the same pathway. These example results have remarkable agreement.
# And View the results
llm_annotation
{"'De Novo' AMP Biosynthetic Process (GO:0044208)": 'Purine biosynthesis pathway',
"'De Novo' Post-Translational Protein Folding (GO:0051084)": 'Protein folding and chaperone-mediated quality control.',
'2-Oxoglutarate Metabolic Process (GO:0006103)': 'Mitochondrial tricarboxylic acid (TCA) cycle and related metabolic pathways.'}