Infrastructure Layer¶
Overview¶
The Infrastructure Layer provides concrete implementations for external concerns like data persistence, caching, and configuration management. It implements the interfaces defined by the domain layer and provides the technical capabilities needed by the application layer.
Following Clean Architecture principles, this layer: - Implements domain repository interfaces - Provides caching mechanisms - Manages application configuration - Handles external dependencies (CSV files, YAML configs) - Remains isolated from business logic
Architecture¶
graph TB
subgraph "Infrastructure Layer"
Persistence[Persistence<br/>CSV Repositories]
Cache[Cache<br/>Multi-tier System]
Config[Configuration<br/>Settings & DI]
PlotConfigs[Plot Configs<br/>YAML Files]
end
Domain[Domain Layer<br/>Interfaces]
Application[Application Layer]
Application --> Persistence
Application --> Cache
Application --> Config
Application --> PlotConfigs
Persistence -.implements.-> Domain
Cache -.-> Domain
style Persistence fill:#e1f5ff
style Cache fill:#fff4e1
style Config fill:#e8f5e9
style PlotConfigs fill:#f3e5f5 Module Structure¶
1. Persistence (infrastructure/persistence)¶
Purpose: Data access layer for CSV-based databases.
Components: - csv_database_repository.py - Generic CSV repository with caching - biorempp_repository.py - BioRemPP database access - kegg_repository.py - KEGG database access - hadeg_repository.py - HADEG database access - toxcsm_repository.py - ToxCSM database access
Key Features:
- Template Method Pattern (base repository with specialized implementations)
- Lazy Loading (databases loaded on first access)
- DataFrame Caching (in-memory caching for performance)
- Flexible Querying (support for single and batch KO lookups)
- Error Handling (graceful handling of missing files/data)
Database Locations:
biorempp_web/data/databases
├── biorempp_table.py → BioRemPP data
├── kegg_table.py → KEGG pathways
├── hadeg_table.py → HADEG annotations
└── toxcsm_table.py → ToxCSM predictions
2. Cache (infrastructure/cache)¶
Purpose: Multi-tier caching system for performance optimization.
Cache Types:
Memory Cache (memory_cache.py)¶
- Type: Generic in-memory cache
- Features: TTL, size limits, LRU eviction
- Use Case: General-purpose caching
DataFrame Cache (dataframe_cache.py)¶
- Type: Specialized for pandas DataFrames
- Features: Hash-based keys, memory-efficient
- Use Case: Database query results, merged data
Graph Cache (graph_cache.py)¶
- Type: Plotly figure caching
- Features: Serialization support, TTL
- Use Case: Generated plots and visualizations
Graph Cache Manager (graph_cache_manager.py)¶
- Type: Centralized cache orchestration
- Features: Multi-layer coordination, invalidation
- Use Case: Plot service caching strategy
Caching Strategy:
graph LR
Request[Plot Request] --> L1{DataFrame<br/>Cache?}
L1 -->|Hit| L2{Graph<br/>Cache?}
L1 -->|Miss| Generate[Generate<br/>DataFrame]
L2 -->|Hit| Return[Return Plot]
L2 -->|Miss| Plot[Generate Plot]
Generate --> Plot
Plot --> Store[Store in<br/>Both Caches]
Store --> Return 3. Configuration (infrastructure/config)¶
Purpose: Application settings and dependency injection.
Components:
Settings (settings.py)¶
- Singleton Pattern (single configuration instance)
- Environment Support (development, production, testing)
-
Features:
-
Database paths configuration
- Cache settings (TTL, sizes)
- Logging configuration
- Feature flags
Database Config (database_config.py)¶
- Purpose: Database connection and path management
-
Features:
-
Path validation
- Database availability checks
- Lazy initialization
- Error handling for missing databases
Dependency Injection (dependency_injection.py)¶
- Pattern: Service Locator + Factory
-
Features:
-
Centralized dependency creation
- Lifecycle management
- Singleton services
- Easy testing (mock injection)
Analysis Registry (analysis_registry.py)¶
- Purpose: Register and discover analysis strategies
-
Features:
-
Strategy registration
- Dynamic discovery
- Metadata management
- Validation
Configuration Hierarchy:
Settings (Singleton)
├── Database Config
│ ├── BioRemPP paths
│ ├── KEGG paths
│ ├── HADEG paths
│ └── ToxCSM paths
├── Cache Config
│ ├── TTL settings
│ ├── Size limits
│ └── Eviction policies
└── Application Config
├── Logging levels
├── Feature flags
└── Environment settings
4. Plot Configurations (infrastructure/plot_configs)¶
Purpose: YAML-based plot configuration files.
Structure:
plot_configs/
├── module1/
│ ├── uc_1_1_config.yaml
│ ├── uc_1_2_config.yaml
│ └── ...
├── module2/
│ ├── uc_2_1_config.yaml
│ ├── uc_2_2_config.yaml
│ └── ...
├── module3/
│ └── ...
└── module4/
└── ...
Configuration Schema:
metadata:
use_case_id: "UC-2.1"
title: "Ranking of Samples by Functional Richness"
description: "Bar chart showing KO count per sample"
module: 2
visualization:
strategy: "BarChartStrategy"
type: "bar"
data:
x_column: "Sample"
y_column: "KO_Count"
aggregation: "count"
layout:
title: "Sample Functional Richness"
xaxis_title: "Sample ID"
yaxis_title: "Number of KOs"
height: 600
cache:
dataframe_ttl: 3600
graph_ttl: 1800
key_template: "uc_{use_case_id}_{data_hash}_{filters_hash}"
Design Patterns¶
Repository Pattern¶
All database access through repository interfaces
Singleton Pattern¶
Configuration and services use singleton
Template Method Pattern¶
Base repository defines algorithm, subclasses customize
Strategy Pattern¶
Cache strategies for different data types
Performance Optimizations¶
1. Multi-Tier Caching¶
Impact: 90%+ reduction in database reads
2. Lazy Loading¶
Impact: Faster startup, lower memory footprint
- Databases loaded only when first accessed
- Configuration loaded on-demand
- Repositories initialized lazily
3. Batch Operations¶
Impact: 10x faster for multiple KO lookups
4. DataFrame Caching¶
Impact: Sub-millisecond retrieval vs. seconds for CSV parsing
- Hash-based cache keys
- TTL-based expiration
- LRU eviction when full
Cache Configuration¶
Default Settings¶
| Cache Type | Max Size | TTL | Eviction |
|---|---|---|---|
| Memory | 100 items | 3600s | LRU |
| DataFrame | 50 items | 3600s | LRU |
| Graph | 30 items | 1800s | LRU |
Cache Key Strategies¶
DataFrame Cache:
Graph Cache:
File Organization¶
infrastructure/
├── __init__.py
├── persistence/ # Data access layer
│ ├── __init__.py
│ ├── csv_database_repository.py
│ ├── biorempp_repository.py
│ ├── kegg_repository.py
│ ├── hadeg_repository.py
│ └── toxcsm_repository.py
├── cache/ # Caching system
│ ├── __init__.py
│ ├── memory_cache.py
│ ├── dataframe_cache.py
│ ├── graph_cache.py
│ └── graph_cache_manager.py
├── config/ # Configuration
│ ├── __init__.py
│ ├── settings.py
│ ├── database_config.py
│ ├── dependency_injection.py
│ ├── analysis_registry.py
│ └── download_config.yaml
└── plot_configs/ # Plot YAML configs
├── module1/
├── module2/
├── module3/
└── module4/
Version¶
Current Version: 1.0.0
See Also¶
- Application Layer Documentation
- Domain Layer Documentation
- Data Tables Documentation