Annotating Gene Sets with Biological Processes Using an LLM

Contents

Annotating Gene Sets with Biological Processes Using an LLM#

This notebook shows how to use AnnDictionary’s functions to have LLMs label a set of genes based on the biological process that they may represent. To demonstrate this functionality, we’ll re-annotate some known gene sets from GO Biological Process (GOBP)

Skip the tutorial#

In case you want to skip the tutorial, here’s all the code you need to run this type of annotation.

To annotate a list of genes:#

import anndict as adt
gene_list = ['gene1', 'gene2', 'gene3']
annotation = adt.ai_biological_process(gene_list=gene_list)

To annotate the results of rank_genes_groups in an AnnData object:#

import anndict as adt

#The results will be returned and also stored in `adata.obs['ai_biological_process']`.
annotation_df = adt.ai_annotate_biological_process(adata, groupby='disease vs. control', n_top_genes=10, new_label_column='ai_biological_process')

Begin the tutorial#

import anndict as adt
import gseapy as gp

# Configure LLM backend
adt.configure_llm_backend(provider='anthropic',
                          model='claude-3-5-sonnet-20240620',
                          api_key='my-anthropic-api-key',
                          requests_per_minute=100
                          )

We’ll use gseapy to access the GOBP database, and re-annotate some gene lists of known process.

# Download the latest Human GOBP gene set collection (2023 release on Enrichr)
gobp = gp.get_library(name="GO_Biological_Process_2023", organism="Human")  # dict: {term: [genes]}

# Inspect how many gene sets were retrieved
print(f"Total GOBP terms: {len(gobp):,}")

Total GOBP terms: 5,406

# Grab any three terms
terms_of_interest = list(gobp.keys())[:3]

# Build a dictionary of the selected gene sets
selected_gene_sets = {term: gobp[term] for term in terms_of_interest}

# Display the gene symbols for each selected term (truncate for readability)
for term, genes in selected_gene_sets.items():
    print(f"\n{term} ({len(genes)} genes)")
    print(", ".join(genes[:15]), "...")       # first 15 genes as a preview

'De Novo' AMP Biosynthetic Process (GO:0044208) (6 genes)
ATIC, PAICS, PFAS, ADSS1, ADSS2, GART ...

'De Novo' Post-Translational Protein Folding (GO:0051084) (32 genes)
SDF2L1, HSPA9, CCT2, HSPA6, ST13, ENTPD5, HSPA1L, HSPA5, PTGES3, HSPA8, HSPA7, DNAJB13, HSPA2, DNAJB14, HSPE1 ...

2-Oxoglutarate Metabolic Process (GO:0006103) (14 genes)
IDH1, PHYH, GOT2, MRPS36, GOT1, IDH2, ADHFE1, GPT2, TAT, DLST, OGDHL, L2HGDH, D2HGDH, OGDH ...

# Now, Let's annotate these known gene sets with an LLM
llm_annotation = {}
for term, genes in selected_gene_sets.items():
    llm_annotation[term] = adt.ai_biological_process(gene_list=genes)

/Users/geocr/repos-local/anndict/anndict/llm/base_llm_initializer.py:49: LangChainBetaWarning: Introduced in 0.2.24. API subject to change.
  return InMemoryRateLimiter(
/Users/geocr/repos-local/anndict/anndict/llm/llm_manager.py:309: LangChainDeprecationWarning: The method `BaseChatModel.__call__` was deprecated in langchain-core 0.1.7 and will be removed in 1.0. Use invoke instead.
  response = llm(langchain_messages, **kwargs)

Now, we can view the results. Here, the keys are the label of the pathway in GOBP, and the values are the LLM-derived label of the same pathway. These example results have remarkable agreement.

# And View the results
llm_annotation

{"'De Novo' AMP Biosynthetic Process (GO:0044208)": 'Purine biosynthesis pathway',
 "'De Novo' Post-Translational Protein Folding (GO:0051084)": 'Protein folding and chaperone-mediated quality control.',
 '2-Oxoglutarate Metabolic Process (GO:0006103)': 'Mitochondrial tricarboxylic acid (TCA) cycle and related metabolic pathways.'}