UC-1.2 — Overlap of Compounds Across Regulatory References¶
Module: 1 – Comparative Assessment of Databases, Samples, and Regulatory Frameworks
Visualization type: UpSet plot (set intersections of compound names)
Primary inputs: BioRemPP results table (BioRemPP_Results.xlsx or BioRemPP_Results.csv)
Primary outputs: Intersection cardinalities of compound lists across regulatory agencies, UpSet visualization
Scientific Question and Rationale¶
Question: To what extent do the lists of monitored chemical compounds overlap between different environmental regulatory agencies, and which compounds appear in multiple regulatory lists?
This use case quantifies the overlap and uniqueness of compounds across the different environmental and regulatory references (referenceAG) cited in the BioRemPP dataset. The UpSet plot provides a systematic view of which compounds are unique to the scope of a single agency versus those that are shared concerns across multiple regulatory bodies. This perspective can be useful for identifying widely recognized pollutants, locating regional or thematic specializations, and assessing the degree of harmonization between regulatory frameworks.
Data and Inputs¶
- Primary data source:
BioRemPP_Results.xlsx or BioRemPP_Results.csv - Key columns:
referenceAG– identifier for the regulatory or scientific agency (e.g., WFD, CONAMA, EPC)compoundname– name of the monitored chemical compound- Accepted format: semicolon-delimited text table (
.txtor.csv) - Entity of interest: unique compound names per agency
Analytical Workflow¶
-
Data Loading
The primary results table (BioRemPP_Results.xlsx or BioRemPP_Results.csv) is loaded into memory. -
Filtering and Cleaning
The dataset is filtered to retain only complete entries containing both a non-emptyreferenceAGandcompoundname. Compound names and agency identifiers are standardized (e.g., trimming whitespace, harmonizing case) to ensure consistent matching. -
Set Construction
The cleaned data is grouped by each uniquereferenceAG. For every agency, a set of all unique compound names associated with it is constructed. These sets represent the compounds monitored or referenced by each regulatory body. -
Intersection Calculation and Rendering
Using the per-agency compound sets, all relevant intersections (single, pairwise, and higher-order) are computed. An UpSet plot is then generated to visualize: - the size of each individual set, and
- the size of all intersections, typically ranked by cardinality.
How to Read the Plot¶
The UpSet plot is composed of three main components:
-
Set Size (Left Bar Chart)
Displays the total number of unique compounds associated with each individual regulatory agency (referenceAG). Larger bars indicate agencies with broader monitored chemical lists. -
Intersection Matrix (Bottom)
The connected dots in the matrix define a specific intersection of agencies. For example, dots connected for "EPC" and "WFD" (and not for others) represent the set of compounds that are monitored by both EPC and WFD, but not by the other agencies. -
Intersection Size (Top Bar Chart)
The height of each bar corresponds to the number of compounds in the intersection defined by the matrix directly below it. Taller bars indicate a larger number of shared compounds among the selected agencies.
Representative Output¶
The image below illustrates a representative output generated by this use case using the example dataset.
Click on the image to enlarge and explore details.
Interpretation and Key Messages¶
-
Shared Regulatory Focus Large bars above intersections containing multiple agencies represent compounds appearing in more than one regulatory list. Presence in multiple lists may reflect broader regulatory recognition, though inclusion criteria and risk frameworks vary across jurisdictions and should be consulted directly for context.
-
Agency-Specific Focus Bars above single, unconnected dots may correspond to compounds unique to a single agency's list. These patterns can suggest specialized regulatory scopes, such as regional priorities, specific industrial sectors, or targeted classes of pollutants (e.g., certain pesticides or industrial chemicals).
-
Harmonization Patterns The extent and structure of overlaps can provide insight into the degree of alignment between regulatory frameworks. Large overlaps may suggest coordinated monitoring approaches, while limited overlaps may reflect different regional priorities, risk assessment methodologies, or stages of regulatory development.
Reproducibility and Assumptions¶
-
Input Format
The analysis assumes a semicolon-delimited table containing at least the columnsreferenceAGandcompoundname. -
Identifier Handling
Compound names and agency identifiers are treated as strings and normalized (e.g., trimming whitespace, harmonizing case) prior to set construction to ensure consistent matching. -
Uniqueness Definition
All counts are based on unique compound names per agency. Duplicate entries within the samereferenceAGgroup are removed before computing set sizes and intersections.
Activity diagram of the use case¶
Click on the image to enlarge and explore details.