Skip to content

UC-6.5 — Chemical–Enzymatic Hierarchy

Module: 6 – Hierarchical and Flow-based Functional Analysis
Visualization type: Treemap (three-level hierarchical composition)
Primary inputs: BioRemPP results table with compoundclass, enzyme_activity, genesymbol, and compoundname
Primary outputs: Hierarchical partitioning of substrate scope across chemical classes → enzyme activities → genes


Scientific Question and Rationale

Question: For each class of chemical compounds, which enzymatic functions are most frequently co-annotated, and which specific genes show the broadest compound co-annotation coverage within that class?

This use case provides a chemical-first, top-down view of the compound co-annotation landscape. Starting from broad compound classes, it traces how these classes are co-annotated with different enzyme activities, which are in turn associated with specific genes found in the available biological samples. By quantifying, for each branch, the number of unique compounds involved, the treemap may highlight which combinations of chemical class, enzymatic function, and gene correspond to the broadest compound co-annotation coverage and thus could represent prominent annotation patterns in the dataset.


Data and Inputs

  • Primary data source: BioRemPP_Results.xlsx or BioRemPP_Results.csv
  • Key columns:
  • compoundclass – high-level chemical class or category
  • enzyme_activity – functional label for the enzymatic activity
  • genesymbol – gene symbol or identifier associated with that activity
  • compoundname – specific compound name or identifier
  • Accepted format: semicolon-delimited text table (.txt or .csv)

  • Hierarchical structure:

  • Compound Class (compoundclass)
  • Enzyme Activity (enzyme_activity)
  • Gene Symbol (genesymbol)

Analytical Workflow

  1. Data Loading
    The primary results table (BioRemPP_Results.xlsx or BioRemPP_Results.csv) is loaded from its semicolon-delimited format.

  2. Hierarchy Definition
    A three-level hierarchy is defined:

  3. Level 1: compoundclass
  4. Level 2: enzyme_activity (nested within each class)
  5. Level 3: genesymbol (nested within each enzyme activity)

  6. Aggregation of Compound Co-annotation Breadth The data is grouped by each unique (compoundclass, enzyme_activity, genesymbol) path:

  7. for each group, the number of distinct compoundname entries is computed (e.g., via nunique()),
  8. this count represents the compound co-annotation breadth (number of unique compounds co-annotated) associated with that gene under that class–activity context.

  9. Value Propagation for Treemap The unique compound counts at the lowest level (per gene) are used as the basic values:

  10. higher-level values for enzyme_activity and compoundclass nodes are obtained by summing the values of all nested nodes,
  11. this yields total compound co-annotation breadth at each level of the hierarchy.

  12. Rendering
    The aggregated data is rendered as an interactive treemap:

  13. each rectangle represents a node in the hierarchy (compound class, enzyme activity, gene),
  14. the area of the rectangle is proportional to its total unique compound count,
  15. color is also mapped to the unique compound count to reinforce the visual encoding.

How to Read the Plot

  • Nested Rectangles (Hierarchy)
    The treemap uses nested rectangles to represent:
  • Outer rectangles: compound classes (compoundclass),
  • within each class, inner rectangles: enzyme activities (enzyme_activity),
  • within each activity, the smallest rectangles: genes (genesymbol).

  • Area (Values) The area of each rectangle is proportional to the total number of unique compounds:

  • for a gene node, area reflects how many distinct compounds that gene is co-annotated with within a specific class and activity,
  • for an enzyme activity node, area reflects the total unique compounds co-annotated with all genes under that activity in that class,
  • for a compound class node, area reflects the full compound co-annotation breadth across all activities and genes.

  • Color Encoding Rectangle color also encodes the unique compound count:

  • brighter or warmer colors indicate broader compound co-annotation coverage,
  • cooler colors indicate a more limited set of co-annotated compounds.

  • Interactivity
    In the interactive view:

  • clicking on a rectangle zooms in to that part of the hierarchy,
  • hovering shows labels (compound class, enzyme activity, gene) and their associated unique compound counts.

Representative Output

The image below illustrates a representative output generated by this use case using the example dataset.

Click on the image to enlarge and explore details.

Representative output for UC-6.5


Interpretation and Key Messages

  • Broadly Co-annotated Compound Classes The largest top-level rectangles may identify compound classes that:
  • are associated with the broadest enzymatic and genetic co-annotation coverage,
  • could represent widely annotated compound groups in the dataset.

  • Prominent Enzymatic Co-annotation Patterns within Classes Within each compound class, the largest enzyme activity rectangles may highlight:

  • the primary co-annotation patterns observed for that class,
  • for example, whether oxidation, hydrolysis, or other activities are most frequently co-annotated.

  • Broadly Co-annotated Genes At the lowest level, large gene rectangles may identify genes with broad co-annotation coverage:

  • genes co-annotated with many distinct compounds within a given class–activity context,
  • candidates for prioritization in further investigation based on their annotation breadth (experimental validation required to confirm functional roles).

  • Comparative Co-annotation Architecture Overall, the treemap may reveal:

  • how enzymatic and genetic co-annotations are distributed across chemical classes,
  • which combinations of class, activity, and gene are most prominent in terms of compound co-annotation coverage,
  • and where there may be gaps (small or absent rectangles) indicating limited annotation coverage for certain combinations.

Reproducibility and Assumptions

  • Input Format
    The analysis assumes a semicolon-delimited table containing:
  • compoundclass, enzyme_activity, genesymbol, and compoundname.

  • Value Definition

  • The fundamental value driving the visualization is the count of unique compound names per (compoundclass, enzyme_activity, genesymbol) group.
  • Higher-level node values are computed as sums of these counts across nested nodes.

  • Interpretation Scope

  • Unique-compound count is used as a measure of compound co-annotation breadth; it does not encode enzyme kinetics, expression levels, or confirmed in situ functional capacity.
  • The treemap should be interpreted as a structural and comparative map of where co-annotation coverage is concentrated within the chemical space of interest, not as direct evidence of enzymatic or degradation capability.

Activity diagram of the use case

Click on the image to enlarge and explore details.

Activity diagram of the use case