Computational Performance Profiling¶

Version: 1.0.0
Profiling Suite Version: v1.0 Last Profiling Run: 2026-01-17

1. Purpose of Computational Profiling¶

1.1 Rationale¶

Computational profiling characterizes the runtime behavior of the BioRemPP web service as part of its internal validation and reproducibility framework.

Profiling provides empirical evidence of:

Resource consumption: CPU time, memory allocation, and I/O throughput for each pipeline stage
Computational consistency: Deterministic function call patterns and stable memory usage across runs
Performance baselines: Reference metrics for regression detection during software updates

1.2 Relationship to Internal Validation¶

Profiling complements biological validation by providing computational reproducibility evidence:

Validation Type	Scope	Evidence
Biological Validation	Data accuracy	Database content verification against source databases
Computational Profiling	Performance consistency	Execution metrics, resource usage, function call patterns
Reproducibility	Cross-run stability	Deterministic outputs, versioned databases, checksummed files

Profiling data is collected alongside database checksums and validation snapshots, enabling complete audit trails for computational behavior. This integrated approach supports the FAIR principles by documenting the computational transparency required for reproducible bioinformatics analyses.

2. Profiling Suite Overview¶

BioRemPP utilizes a dedicated Profiling Suite for systematic performance characterization. The suite is designed as a modular, target-based profiling framework that characterizes specific functional components of the data processing pipeline.

2.1 Suite Architecture¶

The Profiling Suite is implemented in profiling_biorempp/scripts/run_profiling.py and generates structured reports in profiling_biorempp/reports/. Suite components include:

profiling_biorempp/
├── scripts/
│   └── run_profiling.py      # Core profiling engine
└── reports/
    ├── *.stats               # Binary cProfile statistics
    ├── *.txt                 # Function call reports
    ├── profiling_summary_*.json    # Structured metrics
    └── profiling_report_*.md       # Documentation-ready report

2.2 Instrumentation Stack¶

The suite employs Python standard library profiling tools:

cProfile: Deterministic CPU profiling with cumulative time sorting
tracemalloc: Memory allocation tracking for peak usage analysis
psutil: Process-level memory monitoring for resource characterization

This instrumentation stack provides comprehensive performance visibility without introducing external dependencies or instrumentation overhead.

3. Profiling Methodology¶

3.1 Target-Based Profiling Strategy¶

The Profiling Suite organizes measurements by profiling targets, where each target represents a functional component of the BioRemPP pipeline. This approach provides:

Interpretability: Each target maps to a logical pipeline stage
Isolation: Targets execute independently, preventing cross-contamination of metrics
Comparability: Metrics can be compared across runs for regression detection

Target-based profiling enables attribution of computational costs to specific operations rather than aggregating measurements across the entire application.

3.2 Deterministic Execution Model¶

Each profiling target:

Executes in isolation with consistent input data
Produces identical outputs given identical database versions
Generates timestamped reports for audit trails
Outputs structured JSON summaries for programmatic comparison

3.3 Metrics Collected¶

For each profiling target, the suite collects:

Metric	Unit	Description
`execution_time_sec`	seconds	Wall-clock time for target completion
`memory_start_mb`	MB	Process memory before execution
`memory_end_mb`	MB	Process memory after execution
`memory_peak_mb`	MB	Maximum memory during execution
`memory_delta_mb`	MB	Net memory change (end - start)
`function_calls`	count	Total function invocations
`primitive_calls`	count	Non-recursive function calls

These metrics collectively characterize time complexity (via execution time and call counts), space complexity (via memory metrics), and I/O complexity (via file sizes and throughput).

4. Profiled Pipeline Targets¶

The current profiling run evaluated five targets representing critical stages of the BioRemPP data processing pipeline:

4.1 database_load¶

Pipeline stage: Initialization
Function: Load all four databases (BioRemPP, KEGG, HADEG, ToxCSM) into memory

This target characterizes the cost of loading 12,961 database records from CSV files into pandas DataFrames, including data type optimization and memory allocation.

4.2 biorempp_operations¶

Pipeline stage: Core Processing
Function: Filter, transform, and reshape BioRemPP data

This target characterizes in-memory DataFrame operations, including filtering by user criteria and transformation from wide to long format (pd.melt).

4.3 io_operations¶

Pipeline stage: Output Generation
Function: Export data to Excel and JSON formats

This target characterizes serialization costs for user-facing export operations, dominated by Excel XML generation and JSON encoding.

4.4 batch_export¶

Pipeline stage: Batch Processing
Function: Multi-format export (CSV, XLSX, JSON)

This target characterizes the cost of simultaneous export to multiple formats, reflecting batch download functionality.

4.5 data_transforms¶

Pipeline stage: Advanced Processing
Function: Normalization and aggregation operations

This target characterizes advanced analytical transformations, including pathway aggregation and normalization operations that may require scikit-learn and scipy imports.

5. Summary of Profiling Results¶

5.1 Aggregate Performance Metrics¶

Profiling Timestamp: 2026-01-17T01:21:12

Target	Status	Time (s)	Memory Delta (MB)	Peak (MB)	Function Calls
database_load	OK	2.884	84.8	38.4	443,164
biorempp_operations	OK	0.270	-1.2	6.3	17,622
io_operations	OK	2.313	9.0	8.1	669,364
batch_export	OK	2.664	3.7	3.3	697,125
data_transforms	OK	4.367	71.0	36.7	727,729

Total Targets: 5
Successful: 5 (100%)
Total Execution Time: 12.50 seconds
Total Memory Allocated: 167.3 MB
Total Function Calls: 2,555,004

5.2 Database Loading (2.88s, 84.8 MB)¶

Loaded 12,961 records across four databases:

Database	Records
BioRemPP	10,869
HADEG	867
KEGG	855
ToxCSM	370

Primary cost contributors: pandas import chain (2.27s cumulative), CSV parsing (0.52s), DataFrame dtype optimization (0.44s).

5.3 Core Operations (0.27s, -1.2 MB)¶

Processed 10,869 BioRemPP records, producing 76,083 long-format rows after melt transformation. Negative memory delta indicates memory release after garbage collection.

5.4 Export Operations (2.31s + 2.66s, 12.7 MB)¶

Generated multi-format exports:

Format	Size (bytes)
CSV	99,345
Excel (.xlsx)	44,197
JSON	260,370

Primary cost contributors: Excel XML generation (1.80s), cell writing operations (1.60s), JSON serialization (0.24s).

5.5 Data Transformations (4.37s, 71.0 MB)¶

Produced 71 aggregated pathway rows. Primary cost contributors: scikit-learn import chain (4.01s), scipy.stats loading (2.40s).

6. Interpretation of Computational Behavior¶

6.1 Cost Distribution¶

The profiling results characterize three distinct cost categories:

Cost Type	Primary Contributors	Proportion
CPU-bound	`data_transforms` (sklearn/scipy imports)	35% of total time
Memory-bound	`database_load`, `data_transforms`	93% of total memory
I/O-bound	`io_operations`, `batch_export`	40% of total time

6.2 Cost Attribution¶

Observed costs reflect:

Library import overhead: First-time module imports dominate database_load (81%) and data_transforms (92%)
DataFrame allocation: Memory costs scale linearly with database record counts
Serialization libraries: Excel export dominated by openpyxl XML generation, JSON export by standard library encoder

6.3 Baseline Characterization¶

These results establish a computational baseline snapshot for:

Expected memory footprint under normal operation (<85 MB per operation)
Anticipated execution time for pipeline stages (2.3-4.4s for I/O-heavy operations)
Function call patterns for regression detection (2.5M calls across all targets)

7. Reproducibility and Validation Context¶

7.1 Integration with Versioned Databases¶

Profiling runs are associated with specific database versions, verified via SHA-256 checksums:

Database	Rows	Checksum Status
biorempp_db.csv	10,869	SHA-256 validated
hadeg_db.csv	867	SHA-256 validated
kegg_db.csv	855	SHA-256 validated
toxcsm_db.csv	370	SHA-256 validated

7.2 Deterministic Output Verification¶

The Profiling Suite ensures reproducibility through:

Deterministic targets: Each target function produces identical outputs given identical inputs
Timestamped reports: All outputs include generation timestamps
Structured JSON: Enables programmatic comparison across profiling runs
Stable call patterns: Function call counts remain consistent across runs with identical database versions

7.3 Audit Trail Support¶

Profiling data contributes to computational auditability by documenting:

Execution time stability across runs (evidence of algorithmic consistency)
Memory usage patterns (evidence of expected resource consumption)
Function call counts (evidence of deterministic execution paths)
Export file sizes (evidence of reproducible serialization)

8. Scope and Limitations¶

8.1 What Profiling Documents¶

The Profiling Suite characterizes:

Computational cost: Time and memory for each pipeline stage under controlled conditions
Resource allocation: Memory footprint of data structures
I/O throughput: Serialization performance for supported export formats
Function call patterns: Internal execution traces for reproducibility verification

8.2 What Profiling Does Not Validate¶

The Profiling Suite explicitly excludes:

Biological accuracy: Correctness of KO annotations, pathway mappings, or toxicity predictions is validated separately through biological validation procedures
Experimental validation: Profiling does not replace wet-lab validation of bioremediation predictions
Production-scale concurrency: Profiling measurements are collected under single-user, sequential execution conditions
Comparative benchmarking: No comparisons with other bioinformatics tools are performed; profiling is internal-only
Predictive accuracy: Machine learning model performance is outside the scope of computational profiling

8.3 Snapshot-Based Nature¶

Profiling results represent a computational snapshot under specific conditions:

Single-threaded execution (no parallelism)
Controlled environment (development machine, not production server)
Cold-start conditions (no warm caches)
Standard database sizes (12,961 records total)

Results establish expected computational behavior baselines but do not guarantee performance under all deployment scenarios.

8.4 Interpretation Guidelines¶

Profiling results should be interpreted as:

Baseline characterization: Establishing normal performance expectations
Regression detection: Identifying performance changes across software versions
Transparency documentation: Providing reviewers with computational behavior evidence

Profiling results should not be interpreted as:

Validation of scientific correctness
Guarantee of performance under production load
Comparative advantage claims against alternative tools

References¶

Profiling Infrastructure¶

Component	Version	Purpose
Python cProfile	stdlib	CPU profiling
tracemalloc	stdlib	Memory allocation tracking
psutil	>= 5.9.0	Process memory measurement
pstats	stdlib	Statistics analysis and reporting

Report Artifacts¶

All profiling outputs are available in profiling_biorempp/reports/:

profiling_summary_20260117_012112.json – Structured metrics (JSON)
profiling_report_20260117_012112.md – Human-readable report
database_load.txt – Function-level analysis for database loading
biorempp_operations.txt – Core operations analysis
io_operations.txt – Export operations analysis
batch_export.txt – Multi-format export analysis
data_transforms.txt – Transformation analysis