Skip to content

Internal Validation Results (v1.0)

Executive Summary

This document reports the outcome of the BioRemPP Internal Validation Suite execution and provides evidence that the released data resources and analytical transformations behave consistently, remain traceable to a defined data snapshot, and support reproducible analyses.

  • Validation Date: 2026-02-18
  • Run ID: 20260218T220038Z
  • Checkpoint: biorempp_full_validation
  • Output Version: v2.0.0-gx-first
  • Overall Status: All validation components passed successfully.

Scope note: these checks evaluate internal consistency and analytical coherence of database integration and derived outputs. They do not establish biological activity, in situ degradation performance, predictive accuracy, or regulatory compliance.


1. Provenance Snapshot

Validation Results

The provenance snapshot successfully characterized all four integrated databases:

Database Records Columns Data Completeness
BioRemPP 10,869 8 100% (0 nulls)
KEGG 855 3 100% (0 nulls)
HADEG 867 4 100% (0 nulls)
toxCSM 370 66 100% (0 nulls)
Database SHA256 Checksum Size (bytes)
BioRemPP 216cf113400161d6eee8d4eefb13bab23f60f9286874fa41ae8d00f3fc4637c0 1,125,913
KEGG f3df93d3bc5492043d2f6a9ea087b6687757e4757057ba1ab19c1a0d53fcd619 21,612
HADEG d546c01be1cf05866b18aa25fd1edb23e4d90f9ab4e65fb5e37911c1e57ce938 35,380
toxCSM 0d4616930b438964d9e007b20c9ffb9c414879b775a3b89d660bfc6278fe5f38 224,997

Key Findings:

  • All databases exhibit complete data coverage with zero null values across all fields.
  • Cryptographic checksums were computed for each database, enabling detection of any future modifications.
  • Schema structures were documented, including column names and data types.

Interpretation

  • The zero-null snapshot supports downstream joins without imputation.
  • Archived checksums provide a verifiable reference for subsequent releases and for reproducing published analyses.

2. Schema Integrity

Validation Results

Schema integrity validation passed for all four databases:

Schema Suite Expectations Evaluated Passed Failed Status
biorempp_db_schema_integrity_suite 23 23 0 PASS
kegg_degradation_db_schema_integrity_suite 10 10 0 PASS
hadeg_db_schema_integrity_suite 12 12 0 PASS
toxcsm_db_schema_integrity_suite 10 10 0 PASS

Schema Totals:

  • Evaluated expectations: 55
  • Successful expectations: 55
  • Unsuccessful expectations: 0

Interpretation

  • Required structural constraints were satisfied for all validated assets.
  • No schema-level regression was detected in this execution.

3. Cross-Database Overlap

Validation Results

Database Coverage:

Database Unique KO Identifiers
BioRemPP 1,541
KEGG 517
HADEG 337

Pairwise Overlap Analysis:

Database Pair Shared KOs Jaccard Index Coverage A Coverage B
BioRemPP & KEGG 269 0.1504 17.46% 52.03%
BioRemPP & HADEG 128 0.0731 8.31% 37.98%
KEGG & HADEG 169 0.2467 32.69% 50.15%

Core Enzymes:

  • 102 KO identifiers are shared across all three databases.

Exclusive Content:

Database Exclusive KOs Percentage
BioRemPP 1,246 80.9%
KEGG 181 35.0%
HADEG 142 42.1%

Interpretation

  • The overlap structure is consistent with partial concordance across resources and meaningful unique contributions by each source.
  • The shared core set provides a stability anchor for cross-resource comparisons.

4. Mapping Consistency

Validation Results

Mapping consistency validations passed for both mapping suites:

Mapping Suite Expectations Evaluated Passed Failed Status
biorempp_mapping_consistency_suite 3 3 0 PASS
toxcsm_mapping_linkage_suite 3 3 0 PASS

Mapping Totals:

  • Evaluated expectations: 6
  • Successful expectations: 6
  • Unsuccessful expectations: 0

Interpretation

  • KO-compound and compound-toxicity linkage constraints remained stable in this run.
  • No mapping expectation failure was observed.

5. Example Roundtrip Regression

Validation Results

Status: PASS

Datasets Processed: 5 standardized example datasets

Dataset Input KOs Unique KOs BioRemPP Matches KEGG Matches HADEG Matches toxCSM Matches
Example_A 15 15 199 34 35 199
Example_B 12 12 94 19 0 94
Example_C 12 12 65 0 38 65
Example_D 12 12 13 0 0 0
Example_E 13 13 106 19 34 103

Key Findings:

  • All 5 datasets were processed successfully through the complete analytical pipeline.
  • Cryptographic checksums (SHA256) were generated for each input and output file.
  • Content hashes were computed independently of file ordering to verify logical equivalence.
  • Output checksums are archived for future regression testing.

Interpretation

  • Across-database variation in match counts is expected under differential coverage and is consistent with overlap behavior.
  • Archived checksums provide a concrete reference for verifying that subsequent releases preserve expected behavior on the example suite.

6. Use Case Invariants

Validation Results

Status: PASS

Checks Validated: 10/10 (all passed with empty fail reasons)

Dataset Output Type Total Rows Invariant Status
Example_A merged_biorempp 199 PASS
Example_A merged_toxcsm 199 PASS
Example_B merged_biorempp 94 PASS
Example_B merged_toxcsm 94 PASS
Example_C merged_biorempp 65 PASS
Example_C merged_toxcsm 65 PASS
Example_D merged_biorempp 13 PASS
Example_D merged_toxcsm 0 PASS
Example_E merged_biorempp 106 PASS
Example_E merged_toxcsm 103 PASS

Interpretation

These invariants provide a final consistency check over representative merged outputs used in documentation and regression testing.


7. Controlled Vocabulary Audit

Validation Results

Compound Class Distribution:

Class Records Percentage
Aromatic 2,249 20.69%
Nitrogen-containing 2,161 19.88%
Chlorinated 1,816 16.71%
Aliphatic 1,693 15.58%
Polyaromatic 1,471 13.53%
Inorganic 356 3.28%
Metal 340 3.13%
Organophosphorus 269 2.47%
Sulfur-containing 209 1.92%
Organometallic 171 1.57%
Halogenated 130 1.20%
Organosulfur 4 0.04%

Total unique compound classes: 12

Regulatory Agency Distribution:

Agency Records Percentage
ATSDR 2,459 22.62%
IARC2B 1,855 17.07%
EPC 1,349 12.41%
PSL 1,308 12.03%
WFD 1,074 9.88%
IARC1 1,039 9.56%
EPA 912 8.39%
CONAMA 536 4.93%
IARC2A 337 3.10%

Total unique regulatory agencies: 9

Enzyme Activity Distribution:

  • Total unique enzyme activities: 205
  • Most frequent: cytochrome P450 (19.93%), dioxygenase (10.06%), monooxygenase (8.02%)
  • Zero null values across all vocabulary fields

Interpretation

  • This audit provides a stable reference for how controlled terms are used in the current snapshot.
  • Future releases can compare against these baselines to identify reclassification or expansion.

Summary of Validation Status

Component Status Key Metric
Provenance Snapshot PASS 4 databases, 100% data completeness
Schema Integrity PASS 55/55 schema expectations passed
Cross-Database Overlap PASS 102 core shared KOs
Mapping Consistency PASS 6/6 mapping expectations passed
Example Roundtrip Regression PASS 5 datasets, checksums archived
Use Case Invariants PASS 10/10 checks passed
Controlled Vocabulary Audit PASS 12 compound classes, 9 agencies, 205 enzyme activities

Global GX Execution Totals (Checkpoint):

  • Validation definitions executed: 9
  • Expectations evaluated: 102
  • Expectations successful: 102
  • Expectations unsuccessful: 0

Limitations

These results support internal consistency and reproducible behavior for the validated snapshot, but they do not imply:

  • Biological activity or in situ degradation (gene presence and database linkage are not evidence of activity).
  • Predictive accuracy (no gold-standard dataset exists for bioremediation potential).
  • Regulatory compliance or approval (regulatory annotations are provided for contextualization).

Results are snapshot-based and may change as curated resources are updated; provenance checksums and suite versioning are therefore reported alongside this document.