Iterate Over an AdataDict

Iterate Over an AdataDict#

This tutorial demonstrates how to iterate over an AdataDict.

Iteration over an AdataDict is handled by the .fapply() method, which wraps the function adata_dict_fapply() (which can be used separately as a function call if you prefer).

In general, the way to think about working with an AdataDict is to:

  1. Think about what you want to do in terms of operating on a single AnnData

  2. Write a function my_func that does the operations.

  3. Apply it to each anndata in AdataDict with `.fapply(my_func)

Below is an example of this design.

First, build the AdataDict.

import anndict as adt
import scanpy as sc

#load an anndata
adata = sc.datasets.pbmc3k_processed()

# Rename obs column (for legibility)
adata.obs['cell_type'] = adata.obs['louvain']

# Build an AdataDict from this anndata
adata_dict = adt.build_adata_dict(adata, strata_keys=['cell_type'])

Then, let’s say we have some analysis pipeline called my_analysis_pipeline that we want to try running on each celltype separatley. Here’s how we’d do that with an AdataDict.

First, define a function that takes adata and runs the analysis pipeline on it. That would look like this:

def my_analysis_pipeline(adata, some_param=0.5, other_param=None, **kwargs):
    """
    This is my analysis pipeline. It takes an anndata and two parameters for processing.
    """

    #First, normalize the adata to 10k reads per cell
    sc.pp.normalize_total(adata, target_sum=1e4)

    #Then, run some function on it
    result_of_my_pipeline = my_custom_func(adata, some_param=some_param)

    return result_of_my_pipeline

This pipeline does the following:

  • normalize the adata

  • run’s some other function that takes the keyword argument some_param

  • returns the result of this function (whatever this result is, could be a number, a plot, a dataframe, an anndata, etc)

Then, to run this pipeline on each adata in adata_dict, we’d use .fapply as follows.

This code will run my_analysis_pipeline on each adata in adata_dict, passing some_param=0.5 each time.

#Run the function on each adata in adata_dict
all_results = adata_dict.fapply(my_analysis_pipeline, some_param=0.5)

The return behavior of adata_dict.fapply(func) is governed by the return behaviour of func. .fapply will return:

  • a dictionary of the same structure as adata_dict and matching keys, containg the return of func as values.

  • None if func returns None on each adata in adata_dict

In this case, all_results will be a dictionary of return values because we’ve defined my_analysis_pipeline to return some value.

Now, let’s consider a slightly more complicated case. Let’s say we want the value of some_param to be different for each adata in adata_dict. In this case, you can pass a dictionary (with the same keys and structure as adata_dict) to the some_param argument, and .fapply will pass the right value of some_param for each adata.

# We'll need an entry for each cell type:
adata_dict.keys()
dict_keys([('CD4 T cells',), ('CD14+ Monocytes',), ('B cells',), ('CD8 T cells',), ('NK cells',), ('FCGR3A+ Monocytes',), ('Dendritic cells',), ('Megakaryocytes',)])
# Manually define the dictionary to pass to some_param
some_param_dict = {
    ('CD4 T cells',): 0.5,
    ('CD14+ Monocytes',): 0.6,
    ('B cells',): 0.4,
    ('CD8 T cells',): 0.7,
    ('NK cells',): 0.3,
    ('FCGR3A+ Monocytes',): 0.6,
    ('Dendritic cells',): 0.8,
    ('Megakaryocytes',): 0.9
}

We can also take advantage of the fact that fapply returns a dictionary of the same structure as the adata_dict on which it was called to create the argument dictionary like this:

def determine_param(adata):
    """
    This function will calculate the argument to pass to some_param.
    """
    param_value = some_func(adata)
    return param_value

some_param_dict = adata_dict.fapply(determine_param) # This will return a dictionary with the correct structure and keys.

Then pass the dictionary argument to .fapply()

all_results = adata_dict.fapply(my_analysis_pipeline, some_param=some_param_dict) #This gives a different value of some_param for each cell type.

.fapply() can handle a mix of global and adata-specific arguments. For example, you can do this:

all_results = adata_dict.fapply(my_analysis_pipeline, some_param=some_param_dict, other_param=0.5)

And finally, you can define your func to take a parameter called adt_key to make the adata_dict key available to func when func is passed to fapply.

def my_analysis_pipeline_with_adt_key(adata, some_param=0.5, other_param=None, adt_key=None, **kwargs):
    """
    This is my analysis pipeline. It takes an anndata and two parameters for processing and prints the current key being processed.
    """

    print(f"Processing: {adt_key}")
    #First, normalize the adata to 10k reads per cell
    sc.pp.normalize_total(adata, target_sum=1e4)

    #Then, run some function on it
    result_of_my_pipeline = my_custom_func(adata, some_param=some_param)

    return result_of_my_pipeline
all_results = adata_dict.fapply(my_analysis_pipeline_with_adt_key, some_param=some_param_dict, other_param=0.5) # This will now print out each key of adata_dict as it is processed.