Use Case YAML Configuration Methodology¶

Overview¶

BioRemPP employs a declarative YAML-based configuration system to define analytical use cases. This approach separates visualization logic from business rules, enabling reproducible, auditable, and maintainable analytical configurations.

Each use case is fully described through YAML files that specify:

What data to process
How to transform the data
How to visualize the results
What validation rules to apply
How to handle errors

This methodology supports the scientific requirement of traceability: analytical outputs can be linked to a specific, versioned configuration state.

Design Principles¶

Declarative Over Imperative¶

Configuration files describe the desired outcome, not the procedural steps to achieve it. The system interprets these declarations and executes the appropriate logic.

# Declarative: WHAT to do
processing:
  steps:
    - name: "group_and_count"
      params:
        group_by: "Sample"
        agg_function: "nunique"

Separation of Concerns¶

Layer	Responsibility	File
Configuration	What to compute and display	`*_config.yaml`
Scientific Context	Interpretation guidelines	`*_panel.yaml`
Implementation	How to execute	Python strategies

Reproducibility¶

Given identical: - Input data - Configuration version - Database snapshot

The system produces identical outputs.

File Structure¶

Each analytical module contains use case configurations organized by module number:

src/infrastructure/plot_configs/
├── module1/
│   ├── uc_1_1_config.yaml
│   ├── uc_1_1_panel.yaml
│   ├── uc_1_2_config.yaml
│   └── ...
├── module2/
├── module3/
└── ...

Naming Convention¶

Pattern	Description	Example
`uc_X_Y_config.yaml`	Plot configuration	`uc_2_1_config.yaml`
`uc_X_Y_panel.yaml`	Scientific panel	`uc_2_1_panel.yaml`

Where X = module number, Y = use case number within module.

Data Flow¶

┌─────────────────┐
│   YAML Config   │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Data Loading   │ ← Store ID from config
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   Validation    │ ← Rules from config
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   Processing    │ ← Steps from config
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Visualization  │ ← Strategy from config
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   Plotly Figure │
└─────────────────┘

Configuration Schema¶

Plot Configuration (`*_config.yaml`)¶

A complete plot configuration contains the following sections:

1. Metadata Section¶

Identifies the use case and provides context.

Field	Required	Description
`use_case_id`	Yes	Unique identifier (e.g., "UC-2.1")
`module`	Yes	Parent module (e.g., "module2")
`title`	Yes	Human-readable title
`description`	Yes	Brief explanation of the visualization
`version`	Yes	Semantic version (e.g., "1.0.0")
`plot_type`	Yes	Chart type (bar_chart, heatmap, upset, etc.)
`tags`	No	Categorization keywords
`scientific_context`	No	Domain, application, interpretation

2. Data Section¶

Specifies data source and processing pipeline.

Field	Required	Description
`source`	Yes	Store identifier (e.g., "biorempp-merged-data")
`source_type`	Yes	Source type ("store", "file", "api")
`required_columns`	Yes	Columns that must exist in input data
`optional_columns`	No	Columns that enhance functionality
`processing.steps`	Yes	Ordered list of transformation steps

Processing Steps:

Step Name	Purpose	Key Parameters
`validate`	Check data validity	`validator`
`group_and_count`	Aggregate data	`group_by`, `agg_column`, `agg_function`, `result_column`
`sort`	Order results	`by`, `ascending`
`filter_range`	Apply value filters	`column`, `min`, `max`
`limit`	Restrict row count	`n`
`normalize_identifiers`	Clean identifiers	`strip_whitespace`, `convert_uppercase`

3. Visualization Section¶

Controls chart rendering.

Field	Required	Description
`strategy`	Yes	Rendering strategy class
`plotly.chart_type`	Yes	Plotly chart type
`plotly.x`	Depends	X-axis column
`plotly.y`	Depends	Y-axis column
`plotly.orientation`	No	"v" (vertical) or "h" (horizontal)
`plotly.color_discrete_sequence`	No	Color palette
`plotly.layout`	Yes	Chart layout configuration

Available Strategies¶

Total strategies: 19

Strategy	Plot Type	Use Case
`BarChartStrategy`	Bar charts	Rankings, counts
`StackedBarChartStrategy`	Stacked bar charts	Composition across groups
`BoxScatterStrategy`	Box + scatter	Distribution + point-level variability
`DensityPlotStrategy`	Density plots	Distribution shape comparison
`DotPlotStrategy`	Dot plots	Compact comparisons across categories
`HeatmapStrategy`	Heatmaps	Matrices, intensities, correlations
`HeatmapScoredStrategy`	Scored heatmaps	Heatmaps with scoring or ranking overlays
`FacetedHeatmapStrategy`	Multi-panel heatmaps	Category-based comparisons
`CorrelogramStrategy`	Correlograms	Correlation structure analysis
`HierarchicalClusteringStrategy`	Clustering / dendrograms	Similarity grouping and cluster discovery
`PcaStrategy`	PCA plots	Dimensionality reduction and sample separation
`RadarChartStrategy`	Radar charts	Multi-metric profiling
`UpSetStrategy`	UpSet plots	Set intersection analysis
`SankeyStrategy`	Sankey diagrams	Flow and transition visualization
`ChordStrategy`	Chord diagrams	Inter-group relationships
`NetworkStrategy`	Network graphs	Connectivity and interaction patterns
`SunburstStrategy`	Sunburst charts	Hierarchical composition
`TreemapStrategy`	Treemaps	Hierarchical part-to-whole analysis
`FrozensetStrategy`	Set-based utilities	Canonical set handling and deterministic grouping

4. Interactivity Section¶

Defines UI component interactions.

Field	Required	Description
`triggers`	Yes	Components that trigger rendering
`outputs`	Yes	Components that receive output
`states`	No	Additional state dependencies

Trigger Types:

accordion_open — Render when accordion expands
button_click — Render on button click
dropdown_selection — Render on dropdown change
filter_change — Update on filter modification

5. Validation Section¶

Ensures data meets requirements before rendering.

Rule	Purpose	Parameters
`not_empty`	Check data exists	—
`required_columns`	Check columns exist	`columns`
`no_nulls`	Check for null values	`columns`
`minimum_samples`	Check minimum row count	`min_count`
`minimum_databases`	Check database count	`min_count`
`numeric_scores`	Check numeric type	`column`

6. Performance Section¶

Controls caching and logging.

Field	Description
`cache.enabled`	Enable/disable caching
`cache.layers`	Cache layer definitions
`cache.layers[].ttl`	Time-to-live in seconds
`cache.layers[].key_template`	Cache key pattern
`cache.invalidation`	Invalidation triggers
`logging.enabled`	Enable performance logging
`logging.level`	Log verbosity

7. Error Handling Section¶

Defines error responses.

Error Type	Action Options
`missing_columns`	`display_message`, `display_placeholder`
`empty_dataframe`	`display_placeholder`
`processing_errors`	`log_and_notify`, `display_error_message`

Panel Configuration (`*_panel.yaml`)¶

Provides scientific context for users.

Field	Required	Description
`use_case_id`	Yes	Identifier matching config file
`scientific_question`	Yes	Research question addressed
`description`	Yes	What the visualization shows
`visual_elements`	Yes	Explanation of chart elements
`interpretation_guidelines`	Yes	How to interpret results
`color_scheme`	Yes	Bootstrap color scheme

Aggregation Functions Reference¶

Function	Description	Example Result
`nunique`	Count unique values	15 unique KOs
`count`	Count all occurrences	150 total entries
`sum`	Sum numeric values	1,250 total abundance
`mean`	Average of values	0.75 mean score
`median`	Median of values	0.68 median score
`min`	Minimum value	0.1 minimum
`max`	Maximum value	0.95 maximum

Color Scales Reference¶

Scale	Use Case	Description
`Reds`	Toxicity	Intuitive danger gradient
`Blues`	General	Neutral gradient
`Viridis`	Scientific	Perceptually uniform, colorblind-friendly
`Plasma`	High contrast	Wide color range
`Greens`	Positive values	Growth, abundance

Layout Templates Reference¶

Template	Description
`simple_white`	Clean white background
`plotly_white`	Plotly white theme
`plotly_dark`	Dark theme
`ggplot2`	R ggplot2 style
`seaborn`	Seaborn style

Best Practices¶

Version all configurations: Include version field and track changes in version control
Use descriptive identifiers: Use case IDs should follow UC-X.Y pattern
Document scientific context: Panel files should explain interpretation to users
Set appropriate cache TTL: Balance freshness vs. performance
Define comprehensive validation: Catch data issues before rendering
Use consistent naming: Follow established patterns across modules

YAML Configuration Overview — Quick configuration guide for deployment
Methods Overview — High-level methodological framework
Data Sources — Database inventory and provenance
Mapping Strategy — Technical mapping pipeline and join logic
Limitations and Scope Boundaries — Interpretation constraints and usage restrictions
Use Cases Index — Analytical use case catalog
Environment Variables — Runtime configuration for YAML processing