UC-6.1 — Regulatory-to-Molecular Interaction Flow¶

Module: 6 – Hierarchical and Flow-based Functional Analysis
Visualization type: Four-stage alluvial / Sankey diagram
Primary inputs: BioRemPP results table with referenceAG, sample, genesymbol, and compoundname
Primary outputs: Multi-stage flow network from regulatory agencies → samples → genes → compounds

Scientific Question and Rationale¶

Question: How do high-level regulatory contexts flow through specific samples and their gene co-annotations to reach individual chemical compounds?

This use case traces co-annotation paths from environmental or regulatory agencies, through biological samples, down to genes and ultimately compounds. By representing these connections as an alluvial diagram with four ordered stages, the analysis quantifies which regulatory frameworks are most strongly co-annotated with which samples, which genes appear most frequently in those co-annotation contexts, and which chemical compounds emerge as key co-annotation endpoints. The width of each flow encodes how frequently a given path appears, providing a system-level view of how regulatory, sample, gene, and compound annotations are interconnected in the dataset.

Data and Inputs¶

Primary data source: BioRemPP_Results.xlsx or BioRemPP_Results.csv
Key columns:
referenceAG – regulatory or scientific agency label
sample – identifier for each biological sample
genesymbol – gene symbol or identifier
compoundname – chemical compound name or identifier
Accepted format: semicolon-delimited text table (.txt or .csv)
Conceptual flow (stages):
Regulatory Agency (referenceAG)
Sample (sample)
Gene Symbol (genesymbol)
Compound Name (compoundname)

Analytical Workflow¶

Data Loading
The primary results table (BioRemPP_Results.xlsx or BioRemPP_Results.csv) is loaded from its semicolon-delimited format.
Path Definition
A four-stage path is defined for each row using:
referenceAG → sample → genesymbol → compoundname.
Each complete combination represents a single regulatory-to-molecular interaction path.
Aggregation of Flows
The data is grouped by each unique four-step path:
for every unique (referenceAG, sample, genesymbol, compoundname) combination,
the number of occurrences is counted.
This count becomes the flow value that determines ribbon thickness.
Link Construction for Sankey / Alluvial Diagram
The aggregated paths are transformed into a set of linked pairs suitable for a Sankey diagram:
Stage 1 → Stage 2: referenceAG → sample
Stage 2 → Stage 3: sample → genesymbol
Stage 3 → Stage 4: genesymbol → compoundname
Node indices and link values are encoded in the format required by the plotting library.
Rendering
The data is rendered as an interactive alluvial (Sankey) diagram:
vertical columns represent the four stages,
nodes within each column represent unique entities at that stage,
ribbons between columns represent aggregated flows weighted by their counts.

How to Read the Plot¶

Vertical Columns (Stages)
From left to right, the four columns represent:
Regulatory Agencies
Samples
Gene Symbols
Compound Names
Nodes within Columns
Each node is a unique entity at that stage:
a specific agency, sample, gene, or compound.
Node size (height) is proportional to the total flow entering or leaving that node.
Flows (Ribbons) The ribbons connecting nodes represent co-annotation flows:
a ribbon from an agency to a sample may indicate that the sample is co-annotated with compounds monitored by that agency,
a ribbon from a sample to a gene may indicate that the gene is co-annotated with that sample,
a ribbon from a gene to a compound may indicate that the gene is co-annotated with that compound.
Flow Thickness
The thickness of each ribbon is proportional to the number of co-occurrences (the aggregated count for that partial path).
Thicker ribbons may indicate more frequently observed regulatory-to-molecular relationships.
Interactivity
In the interactive version:
hovering over nodes or flows displays labels and numeric values (counts),
nodes may be dragged vertically to improve visual separation of overlapping flows.

Representative Output¶

The image below illustrates a representative output generated by this use case using the example dataset.

Click on the image to enlarge and explore details.

Interpretation and Key Messages¶

Tracing Dominant Co-annotation Pathways The thickest ribbons may highlight the most prominent co-annotation paths:
from a given Regulatory Agency through one or more Samples,
via specific Genes,
down to their co-annotated Compounds. These flows may reveal where regulatory compound lists, sample annotations, and gene co-annotations most strongly overlap.
Identifying Broadly Co-annotated Samples and Genes
A large sample node with many incoming and outgoing flows may indicate a broadly co-annotated sample that connects multiple agencies to multiple genes and compounds.
A large gene node that aggregates flows from many samples to many compounds may suggest a widely co-annotated gene present across multiple annotation contexts.
Regulatory Co-annotation Footprint By following flows from left to right, one may observe:
which agencies are most broadly co-annotated across the observed gene and compound annotations,
and which compounds ultimately constitute the main annotation endpoints in the regulatory context.
System-Level Annotation Overview The diagram can provide an annotation-level overview:
it visually integrates regulatory context, sample annotations, gene co-annotations, and compound co-annotations into a single representation,
potentially enabling the identification of annotation concentration points, redundancies, and gaps to guide downstream experimental prioritization.

Reproducibility and Assumptions¶

Input Format
The analysis assumes a semicolon-delimited table containing at least the columns:
referenceAG, sample, genesymbol, and compoundname.
Flow Definition
Each unique (referenceAG, sample, genesymbol, compoundname) combination contributes a unit count to the corresponding path.
The strength of a flow (ribbon thickness) is defined as the total count of co-occurrences for that path in the raw data.
Scope and Limitations
The alluvial diagram encodes frequency of observation, not kinetic rates, toxicity levels, or regulatory severity.
It should be interpreted as a structural mapping of how regulatory contexts, samples, genes, and compounds are linked, serving as a guide for more detailed downstream analyses rather than a complete risk or performance assessment on its own.

Activity diagram of the use case¶

Click on the image to enlarge and explore details.