Computational Performance Profiling¶
Version: 1.0.0
Profiling Suite Version: v1.0 Last Profiling Run: 2026-01-17
1. Purpose of Computational Profiling¶
1.1 Rationale¶
Computational profiling characterizes the runtime behavior of the BioRemPP web service as part of its internal validation and reproducibility framework.
Profiling provides empirical evidence of:
- Resource consumption: CPU time, memory allocation, and I/O throughput for each pipeline stage
- Computational consistency: Deterministic function call patterns and stable memory usage across runs
- Performance baselines: Reference metrics for regression detection during software updates
1.2 Relationship to Internal Validation¶
Profiling complements biological validation by providing computational reproducibility evidence:
| Validation Type | Scope | Evidence |
|---|---|---|
| Biological Validation | Data accuracy | Database content verification against source databases |
| Computational Profiling | Performance consistency | Execution metrics, resource usage, function call patterns |
| Reproducibility | Cross-run stability | Deterministic outputs, versioned databases, checksummed files |
Profiling data is collected alongside database checksums and validation snapshots, enabling complete audit trails for computational behavior. This integrated approach supports the FAIR principles by documenting the computational transparency required for reproducible bioinformatics analyses.
2. Profiling Suite Overview¶
BioRemPP utilizes a dedicated Profiling Suite for systematic performance characterization. The suite is designed as a modular, target-based profiling framework that characterizes specific functional components of the data processing pipeline.
2.1 Suite Architecture¶
The Profiling Suite is implemented in profiling_biorempp/scripts/run_profiling.py and generates structured reports in profiling_biorempp/reports/. Suite components include:
profiling_biorempp/
├── scripts/
│ └── run_profiling.py # Core profiling engine
└── reports/
├── *.stats # Binary cProfile statistics
├── *.txt # Function call reports
├── profiling_summary_*.json # Structured metrics
└── profiling_report_*.md # Documentation-ready report
2.2 Instrumentation Stack¶
The suite employs Python standard library profiling tools:
- cProfile: Deterministic CPU profiling with cumulative time sorting
- tracemalloc: Memory allocation tracking for peak usage analysis
- psutil: Process-level memory monitoring for resource characterization
This instrumentation stack provides comprehensive performance visibility without introducing external dependencies or instrumentation overhead.
3. Profiling Methodology¶
3.1 Target-Based Profiling Strategy¶
The Profiling Suite organizes measurements by profiling targets, where each target represents a functional component of the BioRemPP pipeline. This approach provides:
- Interpretability: Each target maps to a logical pipeline stage
- Isolation: Targets execute independently, preventing cross-contamination of metrics
- Comparability: Metrics can be compared across runs for regression detection
Target-based profiling enables attribution of computational costs to specific operations rather than aggregating measurements across the entire application.
3.2 Deterministic Execution Model¶
Each profiling target:
- Executes in isolation with consistent input data
- Produces identical outputs given identical database versions
- Generates timestamped reports for audit trails
- Outputs structured JSON summaries for programmatic comparison
3.3 Metrics Collected¶
For each profiling target, the suite collects:
| Metric | Unit | Description |
|---|---|---|
execution_time_sec | seconds | Wall-clock time for target completion |
memory_start_mb | MB | Process memory before execution |
memory_end_mb | MB | Process memory after execution |
memory_peak_mb | MB | Maximum memory during execution |
memory_delta_mb | MB | Net memory change (end - start) |
function_calls | count | Total function invocations |
primitive_calls | count | Non-recursive function calls |
These metrics collectively characterize time complexity (via execution time and call counts), space complexity (via memory metrics), and I/O complexity (via file sizes and throughput).
4. Profiled Pipeline Targets¶
The current profiling run evaluated five targets representing critical stages of the BioRemPP data processing pipeline:
4.1 database_load¶
Pipeline stage: Initialization
Function: Load all four databases (BioRemPP, KEGG, HADEG, ToxCSM) into memory
This target characterizes the cost of loading 12,961 database records from CSV files into pandas DataFrames, including data type optimization and memory allocation.
4.2 biorempp_operations¶
Pipeline stage: Core Processing
Function: Filter, transform, and reshape BioRemPP data
This target characterizes in-memory DataFrame operations, including filtering by user criteria and transformation from wide to long format (pd.melt).
4.3 io_operations¶
Pipeline stage: Output Generation
Function: Export data to Excel and JSON formats
This target characterizes serialization costs for user-facing export operations, dominated by Excel XML generation and JSON encoding.
4.4 batch_export¶
Pipeline stage: Batch Processing
Function: Multi-format export (CSV, XLSX, JSON)
This target characterizes the cost of simultaneous export to multiple formats, reflecting batch download functionality.
4.5 data_transforms¶
Pipeline stage: Advanced Processing
Function: Normalization and aggregation operations
This target characterizes advanced analytical transformations, including pathway aggregation and normalization operations that may require scikit-learn and scipy imports.
5. Summary of Profiling Results¶
5.1 Aggregate Performance Metrics¶
Profiling Timestamp: 2026-01-17T01:21:12
| Target | Status | Time (s) | Memory Delta (MB) | Peak (MB) | Function Calls |
|---|---|---|---|---|---|
| database_load | OK | 2.884 | 84.8 | 38.4 | 443,164 |
| biorempp_operations | OK | 0.270 | -1.2 | 6.3 | 17,622 |
| io_operations | OK | 2.313 | 9.0 | 8.1 | 669,364 |
| batch_export | OK | 2.664 | 3.7 | 3.3 | 697,125 |
| data_transforms | OK | 4.367 | 71.0 | 36.7 | 727,729 |
Total Targets: 5
Successful: 5 (100%)
Total Execution Time: 12.50 seconds
Total Memory Allocated: 167.3 MB
Total Function Calls: 2,555,004
5.2 Database Loading (2.88s, 84.8 MB)¶
Loaded 12,961 records across four databases:
| Database | Records |
|---|---|
| BioRemPP | 10,869 |
| HADEG | 867 |
| KEGG | 855 |
| ToxCSM | 370 |
Primary cost contributors: pandas import chain (2.27s cumulative), CSV parsing (0.52s), DataFrame dtype optimization (0.44s).
5.3 Core Operations (0.27s, -1.2 MB)¶
Processed 10,869 BioRemPP records, producing 76,083 long-format rows after melt transformation. Negative memory delta indicates memory release after garbage collection.
5.4 Export Operations (2.31s + 2.66s, 12.7 MB)¶
Generated multi-format exports:
| Format | Size (bytes) |
|---|---|
| CSV | 99,345 |
| Excel (.xlsx) | 44,197 |
| JSON | 260,370 |
Primary cost contributors: Excel XML generation (1.80s), cell writing operations (1.60s), JSON serialization (0.24s).
5.5 Data Transformations (4.37s, 71.0 MB)¶
Produced 71 aggregated pathway rows. Primary cost contributors: scikit-learn import chain (4.01s), scipy.stats loading (2.40s).
6. Interpretation of Computational Behavior¶
6.1 Cost Distribution¶
The profiling results characterize three distinct cost categories:
| Cost Type | Primary Contributors | Proportion |
|---|---|---|
| CPU-bound | data_transforms (sklearn/scipy imports) | 35% of total time |
| Memory-bound | database_load, data_transforms | 93% of total memory |
| I/O-bound | io_operations, batch_export | 40% of total time |
6.2 Cost Attribution¶
Observed costs reflect:
- Library import overhead: First-time module imports dominate
database_load(81%) anddata_transforms(92%) - DataFrame allocation: Memory costs scale linearly with database record counts
- Serialization libraries: Excel export dominated by openpyxl XML generation, JSON export by standard library encoder
6.3 Baseline Characterization¶
These results establish a computational baseline snapshot for:
- Expected memory footprint under normal operation (<85 MB per operation)
- Anticipated execution time for pipeline stages (2.3-4.4s for I/O-heavy operations)
- Function call patterns for regression detection (2.5M calls across all targets)
7. Reproducibility and Validation Context¶
7.1 Integration with Versioned Databases¶
Profiling runs are associated with specific database versions, verified via SHA-256 checksums:
| Database | Rows | Checksum Status |
|---|---|---|
| biorempp_db.csv | 10,869 | SHA-256 validated |
| hadeg_db.csv | 867 | SHA-256 validated |
| kegg_db.csv | 855 | SHA-256 validated |
| toxcsm_db.csv | 370 | SHA-256 validated |
7.2 Deterministic Output Verification¶
The Profiling Suite ensures reproducibility through:
- Deterministic targets: Each target function produces identical outputs given identical inputs
- Timestamped reports: All outputs include generation timestamps
- Structured JSON: Enables programmatic comparison across profiling runs
- Stable call patterns: Function call counts remain consistent across runs with identical database versions
7.3 Audit Trail Support¶
Profiling data contributes to computational auditability by documenting:
- Execution time stability across runs (evidence of algorithmic consistency)
- Memory usage patterns (evidence of expected resource consumption)
- Function call counts (evidence of deterministic execution paths)
- Export file sizes (evidence of reproducible serialization)
8. Scope and Limitations¶
8.1 What Profiling Documents¶
The Profiling Suite characterizes:
- Computational cost: Time and memory for each pipeline stage under controlled conditions
- Resource allocation: Memory footprint of data structures
- I/O throughput: Serialization performance for supported export formats
- Function call patterns: Internal execution traces for reproducibility verification
8.2 What Profiling Does Not Validate¶
The Profiling Suite explicitly excludes:
- Biological accuracy: Correctness of KO annotations, pathway mappings, or toxicity predictions is validated separately through biological validation procedures
- Experimental validation: Profiling does not replace wet-lab validation of bioremediation predictions
- Production-scale concurrency: Profiling measurements are collected under single-user, sequential execution conditions
- Comparative benchmarking: No comparisons with other bioinformatics tools are performed; profiling is internal-only
- Predictive accuracy: Machine learning model performance is outside the scope of computational profiling
8.3 Snapshot-Based Nature¶
Profiling results represent a computational snapshot under specific conditions:
- Single-threaded execution (no parallelism)
- Controlled environment (development machine, not production server)
- Cold-start conditions (no warm caches)
- Standard database sizes (12,961 records total)
Results establish expected computational behavior baselines but do not guarantee performance under all deployment scenarios.
8.4 Interpretation Guidelines¶
Profiling results should be interpreted as:
- Baseline characterization: Establishing normal performance expectations
- Regression detection: Identifying performance changes across software versions
- Transparency documentation: Providing reviewers with computational behavior evidence
Profiling results should not be interpreted as:
- Validation of scientific correctness
- Guarantee of performance under production load
- Comparative advantage claims against alternative tools
References¶
Profiling Infrastructure¶
| Component | Version | Purpose |
|---|---|---|
| Python cProfile | stdlib | CPU profiling |
| tracemalloc | stdlib | Memory allocation tracking |
| psutil | >= 5.9.0 | Process memory measurement |
| pstats | stdlib | Statistics analysis and reporting |
Report Artifacts¶
All profiling outputs are available in profiling_biorempp/reports/:
profiling_summary_20260117_012112.json– Structured metrics (JSON)profiling_report_20260117_012112.md– Human-readable reportdatabase_load.txt– Function-level analysis for database loadingbiorempp_operations.txt– Core operations analysisio_operations.txt– Export operations analysisbatch_export.txt– Multi-format export analysisdata_transforms.txt– Transformation analysis