# Manipulate the Hierarchy of an AdataDict
This tutorial demonstrates how to manipulate the hierarchy of an AdataDict.
<br>

There are several operations pre-defined. These include:
1. Set the hierarchy
2. Add a stratification
3. Flatten the hierarchy

# Set the hierarchy
An AdataDict is a nested dictionary of anndata. 

The [`.hierarchy`](https://ggit12.github.io/anndictionary/api/adata_dict/adata_dict.html#anndict.AdataDict.hierarchy) attribute describes the current nesting structure of the object.

The [`.set_hierarchy()`](https://ggit12.github.io/anndictionary/api/adata_dict/generated/anndict.AdataDict.set_hierarchy.html#anndict.AdataDict.set_hierarchy) method manipulates the nesting structure (aka the hierarchy).


In [1]:
import anndict as adt
import scanpy as sc

# Load an example dataset from scanpy
adata = sc.datasets.pbmc68k_reduced()

# Rename obs column (for legibility)
adata.obs['cell_type'] = adata.obs['bulk_labels']


# Build the AdataDict
adata_dict = adt.build_adata_dict(adata, strata_keys=['cell_type', 'phase']) # Use 'cell_type' and 'phase' because they're already in the .obs... You could use any categorical columns.

`adata_dict` is now a flat dictionary, where each key is a tuple, and each value in this case has a single cell type-phase group.

We can tell the nesting structure is flat by looking at the `.hierarchy` attribute

In [2]:
adata_dict.hierarchy

('cell_type', 'phase')

and here's what the actual object looks like:

In [3]:
adata_dict

{('CD4+/CD25 T Reg',
  'G1'): AnnData object with n_obs × n_vars = 42 × 765
     obs: 'bulk_labels', 'n_genes', 'percent_mito', 'n_counts', 'S_score', 'G2M_score', 'phase', 'louvain', 'cell_type'
     var: 'n_counts', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
     uns: 'bulk_labels_colors', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
     obsm: 'X_pca', 'X_umap'
     varm: 'PCs'
     obsp: 'distances', 'connectivities',
 ('CD4+/CD25 T Reg',
  'G2M'): AnnData object with n_obs × n_vars = 10 × 765
     obs: 'bulk_labels', 'n_genes', 'percent_mito', 'n_counts', 'S_score', 'G2M_score', 'phase', 'louvain', 'cell_type'
     var: 'n_counts', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
     uns: 'bulk_labels_colors', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
     obsm: 'X_pca', 'X_umap'
     varm: 'PCs'
     obsp: 'distances', 'connectivities',
 ('CD4+/CD25 T Reg',
  'S'): AnnData object with n_obs × n_va

We can re-arrange the structure with the `.set_hierarchy()` method. To use this method, pass a list that is nested in a way that reflects your desired hierarchy.
Here are a few examples, and you can try playing around with it yourself too, to see how it works.

### Let's say we want the top level to be cell_type and the second level to be phase

This could be useful for a pipeline that expects the input to be a single cell type, with separate adata split by phase

In [4]:
adata_dict.set_hierarchy(['cell_type', ['phase']]) # Note the nesting structure of the input list

And here's what the adata_dict looks like now:

In [5]:
adata_dict.hierarchy

('cell_type', ('phase',))

In [6]:
adata_dict 

{('CD4+/CD25 T Reg',): {('G1',): AnnData object with n_obs × n_vars = 42 × 765
      obs: 'bulk_labels', 'n_genes', 'percent_mito', 'n_counts', 'S_score', 'G2M_score', 'phase', 'louvain', 'cell_type'
      var: 'n_counts', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
      uns: 'bulk_labels_colors', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
      obsm: 'X_pca', 'X_umap'
      varm: 'PCs'
      obsp: 'distances', 'connectivities',
  ('G2M',): AnnData object with n_obs × n_vars = 10 × 765
      obs: 'bulk_labels', 'n_genes', 'percent_mito', 'n_counts', 'S_score', 'G2M_score', 'phase', 'louvain', 'cell_type'
      var: 'n_counts', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
      uns: 'bulk_labels_colors', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
      obsm: 'X_pca', 'X_umap'
      varm: 'PCs'
      obsp: 'distances', 'connectivities',
  ('S',): AnnData object with n_obs × n_vars = 16 × 765
      obs

Note that adata_dict[('CD4+/CD25 T Reg',)] now gives an AdataDict containing all the phases for that cell type (the AdataDict has a separate key-value for each phase).

In [7]:
adata_dict[('CD4+/CD25 T Reg',)]

{('G1',): AnnData object with n_obs × n_vars = 42 × 765
     obs: 'bulk_labels', 'n_genes', 'percent_mito', 'n_counts', 'S_score', 'G2M_score', 'phase', 'louvain', 'cell_type'
     var: 'n_counts', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
     uns: 'bulk_labels_colors', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
     obsm: 'X_pca', 'X_umap'
     varm: 'PCs'
     obsp: 'distances', 'connectivities',
 ('G2M',): AnnData object with n_obs × n_vars = 10 × 765
     obs: 'bulk_labels', 'n_genes', 'percent_mito', 'n_counts', 'S_score', 'G2M_score', 'phase', 'louvain', 'cell_type'
     var: 'n_counts', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
     uns: 'bulk_labels_colors', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
     obsm: 'X_pca', 'X_umap'
     varm: 'PCs'
     obsp: 'distances', 'connectivities',
 ('S',): AnnData object with n_obs × n_vars = 16 × 765
     obs: 'bulk_labels', 'n_genes', 'percent_m

Note the nested AdataDicts, where only terminal nodes are AnnData:

In [11]:
adata_dict.__class__

anndict.adata_dict.adata_dict.AdataDict

In [16]:
adata_dict[('CD4+/CD25 T Reg',)].__class__

anndict.adata_dict.adata_dict.AdataDict

In [15]:
adata_dict[('CD4+/CD25 T Reg',)][('G1',)].__class__ # Only the terminal nodes are AnnData

anndata._core.anndata.AnnData

While we show the type of each level of the dictionary here, in practice, you can generally ignore this. The main method of an AdataDict ([`.fapply()`](https://ggit12.github.io/anndictionary/api/adata_dict/generated/anndict.AdataDict.fapply.html#anndict.AdataDict.fapply), covered by the "Iterate Over AdataDict" tutorial) is meant to abstract this away and handles automatically looping over the nested structure of the AdataDict.

## Add a stratification
Now, let's say we also wanted to group by another variable (like tissue or is_control). Here, we'll use the louvain column because it's already in the dataset.

We can add this as a grouping variable with the [`.add_stratification()`](https://ggit12.github.io/anndictionary/api/adata_dict/generated/anndict.AdataDict.add_stratification.html#anndict.AdataDict.add_stratification) method as follows. This will set 'louvain' as the top-level key and nest the existing structure below that.

In [None]:
adata_dict.add_stratification(['louvain'])

Note the new hierarchy:

In [18]:
adata_dict.hierarchy

('louvain', ('cell_type', ('phase',)))

In [19]:
adata_dict

{('0',): {('CD4+/CD25 T Reg',): {('G1',): AnnData object with n_obs × n_vars = 41 × 765
       obs: 'bulk_labels', 'n_genes', 'percent_mito', 'n_counts', 'S_score', 'G2M_score', 'phase', 'louvain', 'cell_type'
       var: 'n_counts', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
       uns: 'bulk_labels_colors', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
       obsm: 'X_pca', 'X_umap'
       varm: 'PCs'
       obsp: 'distances', 'connectivities',
   ('G2M',): AnnData object with n_obs × n_vars = 1 × 765
       obs: 'bulk_labels', 'n_genes', 'percent_mito', 'n_counts', 'S_score', 'G2M_score', 'phase', 'louvain', 'cell_type'
       var: 'n_counts', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
       uns: 'bulk_labels_colors', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
       obsm: 'X_pca', 'X_umap'
       varm: 'PCs'
       obsp: 'distances', 'connectivities',
   ('S',): AnnData object with n_obs × n_var

Note that this AdataDict has 3 levels of nesting, so to access an actual adata, you'd need to do this:

In [20]:
adata_dict[('0',)][('CD4+/CD25 T Reg',)][('S',)]

AnnData object with n_obs × n_vars = 12 × 765
    obs: 'bulk_labels', 'n_genes', 'percent_mito', 'n_counts', 'S_score', 'G2M_score', 'phase', 'louvain', 'cell_type'
    var: 'n_counts', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
    uns: 'bulk_labels_colors', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'

## Flatten the hiearchy

Sometimes it's useful to flatten the adata_dict (i.e. remove the nesting). This can be done with the [`.flatten()`](https://ggit12.github.io/anndictionary/api/adata_dict/generated/anndict.AdataDict.flatten.html) method:

In [21]:
adata_dict.flatten()

Note that the hierarchy is now flat:

In [22]:
adata_dict.hierarchy

('louvain', 'cell_type', 'phase')

In [23]:
adata_dict

{('0',
  'CD4+/CD25 T Reg',
  'G1'): AnnData object with n_obs × n_vars = 41 × 765
     obs: 'bulk_labels', 'n_genes', 'percent_mito', 'n_counts', 'S_score', 'G2M_score', 'phase', 'louvain', 'cell_type'
     var: 'n_counts', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
     uns: 'bulk_labels_colors', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
     obsm: 'X_pca', 'X_umap'
     varm: 'PCs'
     obsp: 'distances', 'connectivities',
 ('0',
  'CD4+/CD25 T Reg',
  'G2M'): AnnData object with n_obs × n_vars = 1 × 765
     obs: 'bulk_labels', 'n_genes', 'percent_mito', 'n_counts', 'S_score', 'G2M_score', 'phase', 'louvain', 'cell_type'
     var: 'n_counts', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
     uns: 'bulk_labels_colors', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
     obsm: 'X_pca', 'X_umap'
     varm: 'PCs'
     obsp: 'distances', 'connectivities',
 ('0',
  'CD4+/CD25 T Reg',
  'S'): AnnData obje

In [24]:
adata_dict[('0', 'CD4+/CD25 T Reg', 'G1')] #AnnData is indexed by a tuple of strings

AnnData object with n_obs × n_vars = 41 × 765
    obs: 'bulk_labels', 'n_genes', 'percent_mito', 'n_counts', 'S_score', 'G2M_score', 'phase', 'louvain', 'cell_type'
    var: 'n_counts', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
    uns: 'bulk_labels_colors', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'