UC-6.2 — Biological Interaction Flow¶

Module: 6 – Hierarchical and Flow-based Functional Analysis
Visualization type: Three-stage alluvial / Sankey diagram
Primary inputs: BioRemPP results table with sample, compoundclass, and enzyme_activity
Primary outputs: Multi-stage flow network from samples → compound classes → enzyme activities

Scientific Question and Rationale¶

Question: How are sample co-annotations distributed across different chemical classes, and which enzymatic activities are most frequently co-annotated with each?

This use case characterizes the co-annotation distribution of the biological samples by tracing how their annotations are distributed across chemical classes and enzyme activities. By organizing the information into a three-stage alluvial diagram, the analysis can reveal which compound classes are most frequently co-annotated with each sample and which enzymatic functions appear most often in those co-annotation contexts. The thickness of each flow encodes how frequently a given combination occurs, providing a quantitative view of broad vs. narrow annotation profiles and enzyme annotation breadth within the dataset.

Data and Inputs¶

Primary data source: BioRemPP_Results.xlsx or BioRemPP_Results.csv
Key columns:
sample – identifier for each biological sample
compoundclass – chemical class/category of the compound
enzyme_activity – functional label for the enzymatic activity (e.g., monooxygenase, dehydrogenase)
Accepted format: semicolon-delimited text table (.txt or .csv)
Conceptual flow (stages):
Sample (sample)
Compound Class (compoundclass)
Enzyme Activity (enzyme_activity)

Analytical Workflow¶

Data Loading
The primary results table (BioRemPP_Results.xlsx or BioRemPP_Results.csv) is loaded from its semicolon-delimited format.
Path Definition
A three-stage path is defined for each row using:
sample → compoundclass → enzyme_activity.
Each complete combination represents a co-annotation flow from a given sample, through a chemical class, to a specific enzymatic function.
Aggregation of Flows
The data is grouped by each unique three-step path:
for every unique (sample, compoundclass, enzyme_activity) combination,
the number of occurrences is counted.
This count becomes the flow value that determines ribbon thickness.
Link Construction for Sankey / Alluvial Diagram
The aggregated paths are transformed into a set of linked pairs suitable for a Sankey diagram:
Stage 1 → Stage 2: sample → compoundclass
Stage 2 → Stage 3: compoundclass → enzyme_activity
Node indices and link values are encoded in the input format required by the plotting library.
Rendering
The data is rendered as an interactive alluvial (Sankey) diagram:
three vertical columns represent the stages,
nodes within each column represent unique entities (samples, compound classes, enzyme activities),
ribbons connecting them represent flows weighted by their aggregated counts.

How to Read the Plot¶

Vertical Columns (Stages)
From left to right, the columns represent:
Sample
Compound Class
Enzyme Activity
Nodes within Columns
Each node is a unique entity at that stage:
a specific sample, a specific compound class, or a specific enzyme activity.
Node size (height) is proportional to the total flow entering or leaving that node.
Flows (Ribbons)
The ribbons connecting nodes represent interaction flows:
a ribbon from a sample to a compound class shows how strongly that sample is associated with that class,
a ribbon from a compound class to an enzyme activity shows how strongly that class is linked to that enzymatic function.
Flow Thickness The thickness of each ribbon is proportional to the number of co-occurrences of that specific path in the data:
thicker ribbons may indicate more frequently observed co-annotation combinations.
Interactivity
In the interactive version:
hovering over nodes or flows reveals labels and numeric values (counts),
nodes can be dragged vertically to reduce overlap and improve readability.

Representative Output¶

The image below illustrates a representative output generated by this use case using the example dataset.

Click on the image to enlarge and explore details.

Interpretation and Key Messages¶

Dominant Co-annotation Patterns per Sample The thickest flows emerging from a given sample may highlight its primary co-annotation patterns:
for example, a strong flow from a sample to an aromatic compound class and then to a monooxygenase activity may suggest that these annotations co-occur frequently in that sample's data.
Broad vs. Narrow Annotation Profiles
A sample dominated by a few very thick flows (one or two classes and enzyme activities) shows narrower co-annotation coverage concentrated in specific categories.
A sample whose flows branch out across many compound classes and enzyme activities shows broader co-annotation coverage across diverse chemical and enzymatic categories.
Broadly Co-annotated Enzyme Activities
Enzyme activity nodes that receive flows from many different compound classes, or from many samples, may indicate broadly co-annotated enzymatic functions—appearing across diverse annotation contexts.
These broadly annotated activities may be candidates for deeper mechanistic investigation.
Chemical Class Co-annotation Profiles The intermediate column for compound classes may reveal:
which classes are most frequently co-annotated in the dataset,
and which enzymatic activities are most often co-annotated with each class.

Reproducibility and Assumptions¶

Input Format
The analysis assumes a semicolon-delimited table containing at least the columns:
sample, compoundclass, and enzyme_activity.
Flow Definition
Each unique (sample, compoundclass, enzyme_activity) combination contributes a unit count to the corresponding path.
The strength of a flow is defined as the total count of co-occurrences for that path in the raw data.
Scope and Limitations
The alluvial diagram encodes frequency of observed co-annotation combinations, not reaction rates, kinetic efficiencies, or thermodynamic feasibility.
It should be interpreted as a structural and comparative overview of which annotation paths are most frequently represented, and not as a direct measure of in situ activity or performance.

Activity diagram of the use case¶

Click on the image to enlarge and explore details.