Skip to content

Computational Performance Profiling

Version: 1.0.0
Profiling Suite Version: v1.0 Last Profiling Run: 2026-01-17


1. Purpose of Computational Profiling

1.1 Rationale

Computational profiling characterizes the runtime behavior of the BioRemPP web service as part of its internal validation and reproducibility framework.

Profiling provides empirical evidence of:

  • Resource consumption: CPU time, memory allocation, and I/O throughput for each pipeline stage
  • Computational consistency: Deterministic function call patterns and stable memory usage across runs
  • Performance baselines: Reference metrics for regression detection during software updates

1.2 Relationship to Internal Validation

Profiling complements biological validation by providing computational reproducibility evidence:

Validation Type Scope Evidence
Biological Validation Data accuracy Database content verification against source databases
Computational Profiling Performance consistency Execution metrics, resource usage, function call patterns
Reproducibility Cross-run stability Deterministic outputs, versioned databases, checksummed files

Profiling data is collected alongside database checksums and validation snapshots, enabling complete audit trails for computational behavior. This integrated approach supports the FAIR principles by documenting the computational transparency required for reproducible bioinformatics analyses.


2. Profiling Suite Overview

BioRemPP utilizes a dedicated Profiling Suite for systematic performance characterization. The suite is designed as a modular, target-based profiling framework that characterizes specific functional components of the data processing pipeline.

2.1 Suite Architecture

The Profiling Suite is implemented in profiling_biorempp/scripts/run_profiling.py and generates structured reports in profiling_biorempp/reports/. Suite components include:

profiling_biorempp/
├── scripts/
│   └── run_profiling.py      # Core profiling engine
└── reports/
    ├── *.stats               # Binary cProfile statistics
    ├── *.txt                 # Function call reports
    ├── profiling_summary_*.json    # Structured metrics
    └── profiling_report_*.md       # Documentation-ready report

2.2 Instrumentation Stack

The suite employs Python standard library profiling tools:

  • cProfile: Deterministic CPU profiling with cumulative time sorting
  • tracemalloc: Memory allocation tracking for peak usage analysis
  • psutil: Process-level memory monitoring for resource characterization

This instrumentation stack provides comprehensive performance visibility without introducing external dependencies or instrumentation overhead.


3. Profiling Methodology

3.1 Target-Based Profiling Strategy

The Profiling Suite organizes measurements by profiling targets, where each target represents a functional component of the BioRemPP pipeline. This approach provides:

  • Interpretability: Each target maps to a logical pipeline stage
  • Isolation: Targets execute independently, preventing cross-contamination of metrics
  • Comparability: Metrics can be compared across runs for regression detection

Target-based profiling enables attribution of computational costs to specific operations rather than aggregating measurements across the entire application.

3.2 Deterministic Execution Model

Each profiling target:

  • Executes in isolation with consistent input data
  • Produces identical outputs given identical database versions
  • Generates timestamped reports for audit trails
  • Outputs structured JSON summaries for programmatic comparison

3.3 Metrics Collected

For each profiling target, the suite collects:

Metric Unit Description
execution_time_sec seconds Wall-clock time for target completion
memory_start_mb MB Process memory before execution
memory_end_mb MB Process memory after execution
memory_peak_mb MB Maximum memory during execution
memory_delta_mb MB Net memory change (end - start)
function_calls count Total function invocations
primitive_calls count Non-recursive function calls

These metrics collectively characterize time complexity (via execution time and call counts), space complexity (via memory metrics), and I/O complexity (via file sizes and throughput).


4. Profiled Pipeline Targets

The current profiling run evaluated five targets representing critical stages of the BioRemPP data processing pipeline:

4.1 database_load

Pipeline stage: Initialization
Function: Load all four databases (BioRemPP, KEGG, HADEG, ToxCSM) into memory

This target characterizes the cost of loading 12,961 database records from CSV files into pandas DataFrames, including data type optimization and memory allocation.

4.2 biorempp_operations

Pipeline stage: Core Processing
Function: Filter, transform, and reshape BioRemPP data

This target characterizes in-memory DataFrame operations, including filtering by user criteria and transformation from wide to long format (pd.melt).

4.3 io_operations

Pipeline stage: Output Generation
Function: Export data to Excel and JSON formats

This target characterizes serialization costs for user-facing export operations, dominated by Excel XML generation and JSON encoding.

4.4 batch_export

Pipeline stage: Batch Processing
Function: Multi-format export (CSV, XLSX, JSON)

This target characterizes the cost of simultaneous export to multiple formats, reflecting batch download functionality.

4.5 data_transforms

Pipeline stage: Advanced Processing
Function: Normalization and aggregation operations

This target characterizes advanced analytical transformations, including pathway aggregation and normalization operations that may require scikit-learn and scipy imports.


5. Summary of Profiling Results

5.1 Aggregate Performance Metrics

Profiling Timestamp: 2026-01-17T01:21:12

Target Status Time (s) Memory Delta (MB) Peak (MB) Function Calls
database_load OK 2.884 84.8 38.4 443,164
biorempp_operations OK 0.270 -1.2 6.3 17,622
io_operations OK 2.313 9.0 8.1 669,364
batch_export OK 2.664 3.7 3.3 697,125
data_transforms OK 4.367 71.0 36.7 727,729

Total Targets: 5
Successful: 5 (100%)
Total Execution Time: 12.50 seconds
Total Memory Allocated: 167.3 MB
Total Function Calls: 2,555,004

5.2 Database Loading (2.88s, 84.8 MB)

Loaded 12,961 records across four databases:

Database Records
BioRemPP 10,869
HADEG 867
KEGG 855
ToxCSM 370

Primary cost contributors: pandas import chain (2.27s cumulative), CSV parsing (0.52s), DataFrame dtype optimization (0.44s).

5.3 Core Operations (0.27s, -1.2 MB)

Processed 10,869 BioRemPP records, producing 76,083 long-format rows after melt transformation. Negative memory delta indicates memory release after garbage collection.

5.4 Export Operations (2.31s + 2.66s, 12.7 MB)

Generated multi-format exports:

Format Size (bytes)
CSV 99,345
Excel (.xlsx) 44,197
JSON 260,370

Primary cost contributors: Excel XML generation (1.80s), cell writing operations (1.60s), JSON serialization (0.24s).

5.5 Data Transformations (4.37s, 71.0 MB)

Produced 71 aggregated pathway rows. Primary cost contributors: scikit-learn import chain (4.01s), scipy.stats loading (2.40s).


6. Interpretation of Computational Behavior

6.1 Cost Distribution

The profiling results characterize three distinct cost categories:

Cost Type Primary Contributors Proportion
CPU-bound data_transforms (sklearn/scipy imports) 35% of total time
Memory-bound database_load, data_transforms 93% of total memory
I/O-bound io_operations, batch_export 40% of total time

6.2 Cost Attribution

Observed costs reflect:

  • Library import overhead: First-time module imports dominate database_load (81%) and data_transforms (92%)
  • DataFrame allocation: Memory costs scale linearly with database record counts
  • Serialization libraries: Excel export dominated by openpyxl XML generation, JSON export by standard library encoder

6.3 Baseline Characterization

These results establish a computational baseline snapshot for:

  • Expected memory footprint under normal operation (<85 MB per operation)
  • Anticipated execution time for pipeline stages (2.3-4.4s for I/O-heavy operations)
  • Function call patterns for regression detection (2.5M calls across all targets)

7. Reproducibility and Validation Context

7.1 Integration with Versioned Databases

Profiling runs are associated with specific database versions, verified via SHA-256 checksums:

Database Rows Checksum Status
biorempp_db.csv 10,869 SHA-256 validated
hadeg_db.csv 867 SHA-256 validated
kegg_db.csv 855 SHA-256 validated
toxcsm_db.csv 370 SHA-256 validated

7.2 Deterministic Output Verification

The Profiling Suite ensures reproducibility through:

  1. Deterministic targets: Each target function produces identical outputs given identical inputs
  2. Timestamped reports: All outputs include generation timestamps
  3. Structured JSON: Enables programmatic comparison across profiling runs
  4. Stable call patterns: Function call counts remain consistent across runs with identical database versions

7.3 Audit Trail Support

Profiling data contributes to computational auditability by documenting:

  • Execution time stability across runs (evidence of algorithmic consistency)
  • Memory usage patterns (evidence of expected resource consumption)
  • Function call counts (evidence of deterministic execution paths)
  • Export file sizes (evidence of reproducible serialization)

8. Scope and Limitations

8.1 What Profiling Documents

The Profiling Suite characterizes:

  • Computational cost: Time and memory for each pipeline stage under controlled conditions
  • Resource allocation: Memory footprint of data structures
  • I/O throughput: Serialization performance for supported export formats
  • Function call patterns: Internal execution traces for reproducibility verification

8.2 What Profiling Does Not Validate

The Profiling Suite explicitly excludes:

  • Biological accuracy: Correctness of KO annotations, pathway mappings, or toxicity predictions is validated separately through biological validation procedures
  • Experimental validation: Profiling does not replace wet-lab validation of bioremediation predictions
  • Production-scale concurrency: Profiling measurements are collected under single-user, sequential execution conditions
  • Comparative benchmarking: No comparisons with other bioinformatics tools are performed; profiling is internal-only
  • Predictive accuracy: Machine learning model performance is outside the scope of computational profiling

8.3 Snapshot-Based Nature

Profiling results represent a computational snapshot under specific conditions:

  • Single-threaded execution (no parallelism)
  • Controlled environment (development machine, not production server)
  • Cold-start conditions (no warm caches)
  • Standard database sizes (12,961 records total)

Results establish expected computational behavior baselines but do not guarantee performance under all deployment scenarios.

8.4 Interpretation Guidelines

Profiling results should be interpreted as:

  • Baseline characterization: Establishing normal performance expectations
  • Regression detection: Identifying performance changes across software versions
  • Transparency documentation: Providing reviewers with computational behavior evidence

Profiling results should not be interpreted as:

  • Validation of scientific correctness
  • Guarantee of performance under production load
  • Comparative advantage claims against alternative tools

References

Profiling Infrastructure

Component Version Purpose
Python cProfile stdlib CPU profiling
tracemalloc stdlib Memory allocation tracking
psutil >= 5.9.0 Process memory measurement
pstats stdlib Statistics analysis and reporting

Report Artifacts

All profiling outputs are available in profiling_biorempp/reports/:

  • profiling_summary_20260117_012112.json – Structured metrics (JSON)
  • profiling_report_20260117_012112.md – Human-readable report
  • database_load.txt – Function-level analysis for database loading
  • biorempp_operations.txt – Core operations analysis
  • io_operations.txt – Export operations analysis
  • batch_export.txt – Multi-format export analysis
  • data_transforms.txt – Transformation analysis