Skip to content

UC-6.1 — Regulatory-to-Molecular Interaction Flow

Module: 6 – Hierarchical and Flow-based Functional Analysis
Visualization type: Four-stage alluvial / Sankey diagram
Primary inputs: BioRemPP results table with referenceAG, sample, genesymbol, and compoundname
Primary outputs: Multi-stage flow network from regulatory agencies → samples → genes → compounds


Scientific Question and Rationale

Question: How do high-level regulatory contexts flow through specific samples and their gene co-annotations to reach individual chemical compounds?

This use case traces co-annotation paths from environmental or regulatory agencies, through biological samples, down to genes and ultimately compounds. By representing these connections as an alluvial diagram with four ordered stages, the analysis quantifies which regulatory frameworks are most strongly co-annotated with which samples, which genes appear most frequently in those co-annotation contexts, and which chemical compounds emerge as key co-annotation endpoints. The width of each flow encodes how frequently a given path appears, providing a system-level view of how regulatory, sample, gene, and compound annotations are interconnected in the dataset.


Data and Inputs

  • Primary data source: BioRemPP_Results.xlsx or BioRemPP_Results.csv
  • Key columns:
  • referenceAG – regulatory or scientific agency label
  • sample – identifier for each biological sample
  • genesymbol – gene symbol or identifier
  • compoundname – chemical compound name or identifier
  • Accepted format: semicolon-delimited text table (.txt or .csv)

  • Conceptual flow (stages):

  • Regulatory Agency (referenceAG)
  • Sample (sample)
  • Gene Symbol (genesymbol)
  • Compound Name (compoundname)

Analytical Workflow

  1. Data Loading
    The primary results table (BioRemPP_Results.xlsx or BioRemPP_Results.csv) is loaded from its semicolon-delimited format.

  2. Path Definition
    A four-stage path is defined for each row using:

  3. referenceAGsamplegenesymbolcompoundname.
    Each complete combination represents a single regulatory-to-molecular interaction path.

  4. Aggregation of Flows
    The data is grouped by each unique four-step path:

  5. for every unique (referenceAG, sample, genesymbol, compoundname) combination,
  6. the number of occurrences is counted.
    This count becomes the flow value that determines ribbon thickness.

  7. Link Construction for Sankey / Alluvial Diagram
    The aggregated paths are transformed into a set of linked pairs suitable for a Sankey diagram:

  8. Stage 1 → Stage 2: referenceAGsample
  9. Stage 2 → Stage 3: samplegenesymbol
  10. Stage 3 → Stage 4: genesymbolcompoundname
    Node indices and link values are encoded in the format required by the plotting library.

  11. Rendering
    The data is rendered as an interactive alluvial (Sankey) diagram:

  12. vertical columns represent the four stages,
  13. nodes within each column represent unique entities at that stage,
  14. ribbons between columns represent aggregated flows weighted by their counts.

How to Read the Plot

  • Vertical Columns (Stages)
    From left to right, the four columns represent:
  • Regulatory Agencies
  • Samples
  • Gene Symbols
  • Compound Names

  • Nodes within Columns
    Each node is a unique entity at that stage:

  • a specific agency, sample, gene, or compound.
    Node size (height) is proportional to the total flow entering or leaving that node.

  • Flows (Ribbons) The ribbons connecting nodes represent co-annotation flows:

  • a ribbon from an agency to a sample may indicate that the sample is co-annotated with compounds monitored by that agency,
  • a ribbon from a sample to a gene may indicate that the gene is co-annotated with that sample,
  • a ribbon from a gene to a compound may indicate that the gene is co-annotated with that compound.

  • Flow Thickness
    The thickness of each ribbon is proportional to the number of co-occurrences (the aggregated count for that partial path).
    Thicker ribbons may indicate more frequently observed regulatory-to-molecular relationships.

  • Interactivity
    In the interactive version:

  • hovering over nodes or flows displays labels and numeric values (counts),
  • nodes may be dragged vertically to improve visual separation of overlapping flows.

Representative Output

The image below illustrates a representative output generated by this use case using the example dataset.

Click on the image to enlarge and explore details.

Representative output for UC-6.1


Interpretation and Key Messages

  • Tracing Dominant Co-annotation Pathways The thickest ribbons may highlight the most prominent co-annotation paths:
  • from a given Regulatory Agency through one or more Samples,
  • via specific Genes,
  • down to their co-annotated Compounds. These flows may reveal where regulatory compound lists, sample annotations, and gene co-annotations most strongly overlap.

  • Identifying Broadly Co-annotated Samples and Genes

  • A large sample node with many incoming and outgoing flows may indicate a broadly co-annotated sample that connects multiple agencies to multiple genes and compounds.
  • A large gene node that aggregates flows from many samples to many compounds may suggest a widely co-annotated gene present across multiple annotation contexts.

  • Regulatory Co-annotation Footprint By following flows from left to right, one may observe:

  • which agencies are most broadly co-annotated across the observed gene and compound annotations,
  • and which compounds ultimately constitute the main annotation endpoints in the regulatory context.

  • System-Level Annotation Overview The diagram can provide an annotation-level overview:

  • it visually integrates regulatory context, sample annotations, gene co-annotations, and compound co-annotations into a single representation,
  • potentially enabling the identification of annotation concentration points, redundancies, and gaps to guide downstream experimental prioritization.

Reproducibility and Assumptions

  • Input Format
    The analysis assumes a semicolon-delimited table containing at least the columns:
  • referenceAG, sample, genesymbol, and compoundname.

  • Flow Definition

  • Each unique (referenceAG, sample, genesymbol, compoundname) combination contributes a unit count to the corresponding path.
  • The strength of a flow (ribbon thickness) is defined as the total count of co-occurrences for that path in the raw data.

  • Scope and Limitations

  • The alluvial diagram encodes frequency of observation, not kinetic rates, toxicity levels, or regulatory severity.
  • It should be interpreted as a structural mapping of how regulatory contexts, samples, genes, and compounds are linked, serving as a guide for more detailed downstream analyses rather than a complete risk or performance assessment on its own.

Activity diagram of the use case

Click on the image to enlarge and explore details.

Activity diagram of the use case