Skip to content

BioRemPP Unit Test Suite: Internal Validation and Quality Assurance

Version: 1.0.0 Test Framework: pytest Total Test Modules: 53


1. Unit Testing Strategy Overview

1.1 Role of Unit Testing

  • Data integrity requirements: Biological data must be processed without corruption or silent transformation errors
  • Reproducibility expectations: Identical inputs must produce identical outputs across software versions
  • Reliability under varied inputs: The system must handle diverse input formats and edge cases gracefully
  • Long-term maintainability: Codebases must evolve while preserving existing functionality

Unit tests address these challenges by providing automated verification of component behavior, enabling developers to detect regressions immediately upon code modification.

1.2 Integration with Development Lifecycle

Unit tests in BioRemPP function as continuous validation mechanisms throughout the development lifecycle:

  • Pre-commit verification: Tests execute before code integration to prevent defect introduction
  • Regression detection: Existing tests identify unintended behavioral changes during refactoring
  • Documentation: Test cases serve as executable specifications of expected component behavior
  • Confidence building: Comprehensive test coverage supports confident deployment of updates

2. Scope of the BioRemPP Unit Test Suite

2.1 Suite Composition

BioRemPP implements a structured unit test suite comprising 53 test modules organized by architectural responsibility. The suite provides coverage across multiple layers of the application architecture, ensuring that both business logic and infrastructure components operate correctly.

The test organization follows the application's layered architecture:

tests/unit/
├── application/           # Application layer tests
│   ├── core/             # Core processing components
│   ├── dto/              # Data transfer objects
│   ├── mappers/          # Data mapping utilities
│   ├── plot_services/    # Visualization services
│   └── services/         # Application services
├── domain/               # Domain layer tests
│   ├── entities/         # Domain entities
│   ├── value_objects/    # Value objects
│   ├── services/         # Domain services
│   └── plot_strategies/  # Visualization strategies
└── infrastructure/       # Infrastructure layer tests
    ├── cache/            # Caching components
    ├── config/           # Configuration management
    └── persistence/      # Data access repositories

2.2 Test Scope by Layer

The unit test suite employs a layer-based testing strategy where each architectural layer receives dedicated test coverage:

Layer Responsibility Test Focus
Domain Business rules and entities Validation, invariants, behavior
Application Use cases and orchestration Coordination, data flow, transformations
Infrastructure Technical concerns Persistence, caching, configuration

This separation ensures that tests remain focused, maintainable, and aligned with the Single Responsibility Principle.


3. Test Coverage by Architectural Layer

3.1 Domain Layer

The domain layer encapsulates the core business logic of BioRemPP, including entities representing biological concepts, value objects enforcing data constraints, and services implementing domain rules.

Entities

Domain entity tests validate:

  • Dataset: Collection management for biological samples, including addition, removal, retrieval, and validation of samples containing KEGG Orthology annotations
  • Sample: Individual biological sample representation with KO list management and sample-level operations

Entity tests ensure that business rules are enforced at the object level, preventing invalid states from propagating through the system.

Value Objects

Value object tests verify:

  • KEGG Orthology (KO): Validation of KO identifier format (K followed by 5 digits), immutability guarantees, and equality semantics
  • SampleId: Sample identifier validation, format enforcement, and identity operations

Value objects serve as the foundation of type safety in the domain model, and their tests ensure that invalid data cannot enter the system.

Domain Services

Domain service tests cover:

  • ValidationService: Input validation rules for uploaded data, format verification, and constraint enforcement

These tests verify that domain services correctly implement business rules independently of infrastructure concerns.

Visualization Strategies

A comprehensive suite of 19 visualization strategy tests validates the correctness of chart generation logic:

  • Statistical visualizations (Heatmap, Correlogram, PCA, Hierarchical Clustering)
  • Distribution visualizations (Bar Chart, Stacked Bar, Box-Scatter, Density Plot)
  • Relationship visualizations (Network, Chord, Sankey)
  • Hierarchical visualizations (Treemap, Sunburst)
  • Comparative visualizations (Radar Chart, UpSet Plot, Dot Plot)
  • Matrix visualizations (Faceted Heatmap, Heatmap Scored, FrozenSet)

Each strategy test validates data processing logic, figure generation, configuration handling, and edge case behavior.

3.2 Infrastructure Layer: Configuration and Dependency Management

Configuration and dependency management tests prevent silent failures that could compromise system reliability.

Configuration Tests

  • Settings: Environment-specific configuration loading, default value handling, and production mode enforcement
  • DatabaseConfig: Database path resolution and configuration consistency across environments

Dependency Injection Tests

  • DIContainer: Singleton registration and resolution, factory pattern support, type registration, and dependency chain resolution
  • AnalysisRegistry: Service registration, analysis type mapping, and resolution correctness

These tests ensure that the application initializes correctly and that dependencies are wired properly, preventing runtime failures due to misconfiguration.

3.3 Infrastructure Layer: Persistence and Repositories

Repository tests validate data access contracts without requiring external database connections.

Repository Tests

  • BioRemPPRepository: Access to the primary bioremediation potential database
  • KEGGRepository: KEGG pathway database access and data transformation
  • HADEGRepository: Hydrocarbon Aerobic Degradation database operations
  • ToxCSMRepository: Toxicity prediction database access
  • CSVDatabaseRepository: Base repository behavior for CSV-based data sources

Repository tests verify:

  • Correct loading of database files
  • Data transformation and type optimization
  • Error handling for missing or malformed data
  • Consistency of returned data structures

Tests employ isolation techniques to avoid dependencies on external systems, ensuring fast and reliable execution.

3.4 Infrastructure Layer: Cache and Performance Support

Cache tests validate the correctness of performance optimization components.

Cache Component Tests

  • MemoryCache: In-memory caching with TTL support, LRU eviction policy, size limits, and statistics tracking
  • DataFrameCache: Specialized caching for pandas DataFrame objects with serialization handling
  • GraphCache: Caching support for network graph structures and computed visualizations

Cache tests ensure that:

  • Cached values are stored and retrieved correctly
  • Expiration policies function as specified
  • Eviction occurs correctly when size limits are reached
  • Cache statistics accurately reflect operations

These components directly impact web service performance and user experience, making their correctness critical.

3.5 Application Layer

Application layer tests validate use case implementations and service orchestration.

Core Processing Tests

  • DataProcessor: Pipeline orchestration including cache checking, database merging, progress tracking, and result preparation
  • SampleParser: Input parsing logic for various file formats
  • UploadHandler: File upload processing and validation
  • ResultExporter: Export functionality for CSV, Excel, and JSON formats

Data Transfer Object Tests

  • MergedDataDTO: Validation and consistency of merged analysis results
  • UploadResultDTO: Upload operation result representation
  • ValidationResultDTO: Validation outcome representation

Mapper Tests

  • MergedDataMapper: Transformation between domain objects and DTOs
  • SampleMapper: Sample data mapping and conversion

Service Tests

  • AnalysisOrchestrator: End-to-end analysis workflow coordination
  • CacheService: Application-level caching operations
  • ProgressTracker: Progress reporting for long-running operations

Plot Service Tests

  • PlotService: Visualization generation orchestration
  • PlotFactory: Strategy selection and instantiation
  • PlotConfigLoader: Visualization configuration management
  • Singleton Pattern: Memory-efficient service instantiation

4. Test Design Principles

The BioRemPP unit test suite adheres to established testing principles that ensure reliability and maintainability.

4.1 Determinism

All tests produce consistent results across executions. Tests avoid:

  • Time-dependent assertions without mocking
  • Random data without fixed seeds
  • External service dependencies

Deterministic tests enable confident interpretation of test results and reliable CI/CD integration.

4.2 Isolation

Each test executes independently without relying on state from other tests:

  • Fresh fixtures are created for each test method
  • Shared state is avoided or explicitly reset
  • Tests can execute in any order

Isolation prevents cascading failures and simplifies debugging.

4.3 Dependency Substitution

Tests employ mocks and stubs to isolate components from their dependencies:

  • Repository tests mock file system operations
  • Service tests mock repository dependencies
  • Integration points use test doubles

This approach enables fast execution and focused assertions.

4.4 Fast Execution

The test suite prioritizes execution speed to enable frequent testing:

  • In-memory operations where possible
  • Minimal I/O operations
  • Efficient fixture setup

Fast tests encourage developers to run the suite frequently during development.

4.5 Clarity

Tests serve as documentation through clear naming and structure:

  • Descriptive test method names indicate expected behavior
  • Test classes group related scenarios
  • Docstrings explain test purpose and coverage

5. Shared Test Fixtures and Data Management

The BioRemPP test suite employs a centralized fixture infrastructure defined in tests/conftest.py to ensure consistency, reproducibility, and maintainability across all test modules.

5.1 Fixture Architecture

Fixtures are organized by responsibility and scope:

Category Purpose Examples
Domain Entities Provide valid domain objects sample_with_kos, empty_dataset, sample_ko_list
Value Objects Supply valid and invalid identifiers valid_ko_ids, invalid_ko_ids, sample_id_instance
Infrastructure Temporary files and cache instances temp_dir, temp_csv_file, memory_cache
DataFrames Test data at various scales small_dataframe, large_dataframe, realistic_biorempp_dataframe
Edge Cases Boundary conditions edge_case_whitespace_string, edge_case_duplicate_samples
Cross-Database Multi-database scenarios linked_ko_data, common_kos_all_databases

5.2 Realistic Test Data

The fixture system utilizes representative data extracted from actual databases, ensuring that tests operate on realistic data patterns rather than synthetic examples:

  • Session-scoped analysis data: Pre-computed database statistics loaded once per test session
  • Representative samples: Real KO identifiers, pathway names, and data structures from production databases
  • Cross-database linkages: Test data reflecting actual relationships between BioRemPP, KEGG, HADEG, and ToxCSM databases

This approach ensures that tests validate behavior against data patterns that users will encounter in production.

5.3 Fixture Scopes for Reproducibility

Fixtures employ appropriate scopes to balance reproducibility with performance:

  • Session scope: Expensive data loading operations (database analysis) execute once per test session
  • Function scope: Most fixtures create fresh instances per test, ensuring isolation
  • Automatic cleanup: Temporary files and directories are automatically removed after tests complete

5.4 Edge Case Coverage

Dedicated fixtures systematically test boundary conditions:

  • Empty strings and whitespace handling
  • Invalid identifier formats
  • Duplicate sample identifiers
  • NULL value propagation
  • Large datasets for performance validation

6. Role of Unit Tests in Validation and Reproducibility

6.1 Scope of Validation

Unit tests in BioRemPP provide functional and structural validation of software components. It is important to distinguish this from biological validation:

Validation Type Scope Provided by Unit Tests
Functional Correctness Component behavior matches specification Yes
Structural Integrity Data structures maintain invariants Yes
Biological Accuracy Predictions match experimental results No
Scientific Validity Methodology is sound No

Unit tests verify that the software correctly implements its intended functionality, not that the underlying scientific methodology is valid.

6.2 Contribution to Reproducibility

Unit tests support reproducibility by ensuring:

  • Consistent data processing: Input parsing and transformation behave identically across runs
  • Stable outputs: Deterministic operations produce the same results
  • Version stability: Regression tests detect behavioral changes between versions
  • Configuration consistency: Settings and dependencies resolve correctly

6.3 Integration with Internal Validation

The unit test suite complements other validation mechanisms in BioRemPP:

  • Database Validation Suite: Verifies database content integrity and consistency
  • Profiling Suite: Characterizes computational performance
  • Unit Tests: Ensures functional correctness of software components

Together, these mechanisms provide comprehensive internal validation of the web service.

6.4 Regression Testing

Unit tests serve as the foundation for regression testing:

  • Existing tests must pass before new code is integrated
  • Behavioral changes require explicit test updates
  • Test failures indicate potential regressions

This approach ensures that the system maintains its validated behavior as it evolves.


7. Limitations and Scope Boundaries

7.1 What Unit Tests Cover

The BioRemPP unit test suite validates:

  • Component initialization and configuration
  • Input validation and error handling
  • Data transformation correctness
  • Business rule enforcement
  • Service coordination and orchestration
  • Cache behavior and performance optimization
  • Export format generation

7.2 What Unit Tests Do Not Cover

  • Biological validation: Tests do not verify that bioremediation potential predictions are experimentally accurate
  • External benchmarking: Tests do not compare BioRemPP performance or accuracy with other tools
  • User interface testing: Dash component rendering and user interaction are not covered
  • Performance benchmarking: Tests verify correctness, not execution time targets
  • Network operations: Tests do not validate actual HTTP request handling