Persistence Layer¶
The Persistence Layer provides repository implementations for accessing external databases containing biological, pathway, and toxicity data.
Repository Implementations¶
KEGGRepository¶
KEGGRepository ¶
KEGGRepository(filepath: Path = Path('data/databases/kegg_degradation_db.csv'), encoding: str = 'utf-8', separator: str = ';')
Bases: CSVDatabaseRepository
Repository for KEGG degradation pathways database.
Provides access to KEGG pathway data for degradation processes. Database file: data/databases/kegg_degradation_db.csv
Attributes:
| Name | Type | Description |
|---|---|---|
filepath | Path | Path to KEGG database CSV file |
encoding | str | File encoding (default: 'utf-8') |
separator | str | CSV separator (default: ';') |
required_columns | list[str] | Required columns: ['ko', 'pathname'] |
Initialize KEGG repository.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
filepath | Path | Path to KEGG database CSV file. | Path('data/databases/kegg_degradation_db.csv') |
encoding | str | File encoding. | 'utf-8' |
separator | str | CSV separator. | ';' |
Source code in src/infrastructure/persistence/kegg_repository.py
Functions¶
load_data ¶
Load CSV database into DataFrame with caching.
Returns:
| Type | Description |
|---|---|
DataFrame | Database data with optimized dtypes |
Raises:
| Type | Description |
|---|---|
FileNotFoundError | If CSV file doesn't exist |
ValueError | If CSV format is invalid or required columns missing |
Source code in src/infrastructure/persistence/csv_database_repository.py
reload_data ¶
Force reload database from file.
Clears cache and reloads data from CSV file.
Returns:
| Type | Description |
|---|---|
DataFrame | Freshly loaded database data |
Source code in src/infrastructure/persistence/csv_database_repository.py
merge_with_dataset ¶
Merge dataset with database.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_df | DataFrame | Input dataset (must have join column) | required |
on | str | Column name to join on | 'ko' |
how | str | Join type ('inner', 'left', 'right', 'outer') | 'inner' |
Returns:
| Type | Description |
|---|---|
DataFrame | Merged DataFrame |
Raises:
| Type | Description |
|---|---|
ValueError | If join column missing in either DataFrame |
Source code in src/infrastructure/persistence/csv_database_repository.py
get_column_names ¶
Get column names from database.
Returns:
| Type | Description |
|---|---|
list[str] | List of column names |
validate_schema ¶
Validate database schema.
Checks if all required columns are present in DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df | Optional[DataFrame] | DataFrame to validate (if None, loads from file) | None |
Returns:
| Type | Description |
|---|---|
bool | True if all required columns present, False otherwise |
Source code in src/infrastructure/persistence/csv_database_repository.py
get_stats ¶
Get database statistics.
Returns:
| Type | Description |
|---|---|
dict | Dictionary containing: - 'rows': Number of rows - 'columns': Number of columns - 'memory_mb': Memory usage in MB - 'column_names': List of column names - 'dtypes': Dictionary of column datatypes |
Source code in src/infrastructure/persistence/csv_database_repository.py
BioRemPPRepository¶
BioRemPPRepository ¶
BioRemPPRepository(filepath: Path = Path('data/databases/biorempp_db.csv'), encoding: str = 'utf-8', separator: str = ';')
Bases: CSVDatabaseRepository
Repository for BioRemPP bioremediation database.
Provides access to bioremediation data mapped to KEGG Orthology IDs. Database file: data/databases/biorempp_db.csv
Attributes:
| Name | Type | Description |
|---|---|---|
filepath | Path | Path to BioRemPP database CSV file |
encoding | str | File encoding (default: 'utf-8') |
separator | str | CSV separator (default: ';') |
required_columns | list[str] | Required columns: ['ko'] |
Initialize BioRemPP repository.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
filepath | Path | Path to BioRemPP database CSV file. | Path('data/databases/biorempp_db.csv') |
encoding | str | File encoding. | 'utf-8' |
separator | str | CSV separator. | ';' |
Source code in src/infrastructure/persistence/biorempp_repository.py
Functions¶
load_data ¶
Load CSV database into DataFrame with caching.
Returns:
| Type | Description |
|---|---|
DataFrame | Database data with optimized dtypes |
Raises:
| Type | Description |
|---|---|
FileNotFoundError | If CSV file doesn't exist |
ValueError | If CSV format is invalid or required columns missing |
Source code in src/infrastructure/persistence/csv_database_repository.py
reload_data ¶
Force reload database from file.
Clears cache and reloads data from CSV file.
Returns:
| Type | Description |
|---|---|
DataFrame | Freshly loaded database data |
Source code in src/infrastructure/persistence/csv_database_repository.py
merge_with_dataset ¶
Merge dataset with database.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_df | DataFrame | Input dataset (must have join column) | required |
on | str | Column name to join on | 'ko' |
how | str | Join type ('inner', 'left', 'right', 'outer') | 'inner' |
Returns:
| Type | Description |
|---|---|
DataFrame | Merged DataFrame |
Raises:
| Type | Description |
|---|---|
ValueError | If join column missing in either DataFrame |
Source code in src/infrastructure/persistence/csv_database_repository.py
get_column_names ¶
Get column names from database.
Returns:
| Type | Description |
|---|---|
list[str] | List of column names |
validate_schema ¶
Validate database schema.
Checks if all required columns are present in DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df | Optional[DataFrame] | DataFrame to validate (if None, loads from file) | None |
Returns:
| Type | Description |
|---|---|
bool | True if all required columns present, False otherwise |
Source code in src/infrastructure/persistence/csv_database_repository.py
get_stats ¶
Get database statistics.
Returns:
| Type | Description |
|---|---|
dict | Dictionary containing: - 'rows': Number of rows - 'columns': Number of columns - 'memory_mb': Memory usage in MB - 'column_names': List of column names - 'dtypes': Dictionary of column datatypes |
Source code in src/infrastructure/persistence/csv_database_repository.py
HADEGRepository¶
HADEGRepository ¶
HADEGRepository(filepath: Path = Path('data/databases/hadeg_db.csv'), encoding: str = 'utf-8', separator: str = ';')
Bases: CSVDatabaseRepository
Repository for HADEG enzyme database.
Provides access to enzyme data for hydrocarbon degradation pathways. Database file: data/databases/hadeg_db.csv
Attributes:
| Name | Type | Description |
|---|---|---|
filepath | Path | Path to HADEG database CSV file |
encoding | str | File encoding (default: 'utf-8') |
separator | str | CSV separator (default: ';') |
required_columns | list[str] | Required columns: ['ko', 'Gene', 'Pathway'] |
Initialize HADEG repository.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
filepath | Path | Path to HADEG database CSV file. | Path('data/databases/hadeg_db.csv') |
encoding | str | File encoding. | 'utf-8' |
separator | str | CSV separator. | ';' |
Source code in src/infrastructure/persistence/hadeg_repository.py
Functions¶
load_data ¶
Load CSV database into DataFrame with caching.
Returns:
| Type | Description |
|---|---|
DataFrame | Database data with optimized dtypes |
Raises:
| Type | Description |
|---|---|
FileNotFoundError | If CSV file doesn't exist |
ValueError | If CSV format is invalid or required columns missing |
Source code in src/infrastructure/persistence/csv_database_repository.py
reload_data ¶
Force reload database from file.
Clears cache and reloads data from CSV file.
Returns:
| Type | Description |
|---|---|
DataFrame | Freshly loaded database data |
Source code in src/infrastructure/persistence/csv_database_repository.py
merge_with_dataset ¶
Merge dataset with database.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_df | DataFrame | Input dataset (must have join column) | required |
on | str | Column name to join on | 'ko' |
how | str | Join type ('inner', 'left', 'right', 'outer') | 'inner' |
Returns:
| Type | Description |
|---|---|
DataFrame | Merged DataFrame |
Raises:
| Type | Description |
|---|---|
ValueError | If join column missing in either DataFrame |
Source code in src/infrastructure/persistence/csv_database_repository.py
get_column_names ¶
Get column names from database.
Returns:
| Type | Description |
|---|---|
list[str] | List of column names |
validate_schema ¶
Validate database schema.
Checks if all required columns are present in DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df | Optional[DataFrame] | DataFrame to validate (if None, loads from file) | None |
Returns:
| Type | Description |
|---|---|
bool | True if all required columns present, False otherwise |
Source code in src/infrastructure/persistence/csv_database_repository.py
get_stats ¶
Get database statistics.
Returns:
| Type | Description |
|---|---|
dict | Dictionary containing: - 'rows': Number of rows - 'columns': Number of columns - 'memory_mb': Memory usage in MB - 'column_names': List of column names - 'dtypes': Dictionary of column datatypes |
Source code in src/infrastructure/persistence/csv_database_repository.py
ToxCSMRepository¶
ToxCSMRepository ¶
ToxCSMRepository(filepath: Path = Path('data/databases/toxcsm_db.csv'), encoding: str = 'utf-8', separator: str = ';')
Bases: CSVDatabaseRepository
Repository for ToxCSM toxicity prediction database.
Provides access to compound-level toxicity predictions. Database file: data/databases/toxcsm_db.csv
Attributes:
| Name | Type | Description |
|---|---|---|
filepath | Path | Path to ToxCSM database CSV file |
encoding | str | File encoding (default: 'utf-8') |
separator | str | CSV separator (default: ';') |
required_columns | list[str] | Required columns: ['cpd'] |
Methods:
| Name | Description |
|---|---|
merge_with_compound_data | Merge compound data with toxicity predictions |
Notes
- Merges on 'cpd' column instead of 'ko' (compound-level data)
Initialize ToxCSM repository.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
filepath | Path | Path to ToxCSM database CSV file. | Path('data/databases/toxcsm_db.csv') |
encoding | str | File encoding. | 'utf-8' |
separator | str | CSV separator. | ';' |
Source code in src/infrastructure/persistence/toxcsm_repository.py
Functions¶
merge_with_compound_data ¶
merge_with_compound_data(compound_df: DataFrame, on: str = 'cpd', how: str = 'left') -> pd.DataFrame
Merge compound data with toxicity predictions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
compound_df | DataFrame | DataFrame containing compound information (must have join column) | required |
on | str | Column to join on | 'cpd' |
how | str | Join type (default 'left' keeps all compounds) | 'left' |
Returns:
| Type | Description |
|---|---|
DataFrame | Merged DataFrame with toxicity predictions |
Source code in src/infrastructure/persistence/toxcsm_repository.py
load_data ¶
Load CSV database into DataFrame with caching.
Returns:
| Type | Description |
|---|---|
DataFrame | Database data with optimized dtypes |
Raises:
| Type | Description |
|---|---|
FileNotFoundError | If CSV file doesn't exist |
ValueError | If CSV format is invalid or required columns missing |
Source code in src/infrastructure/persistence/csv_database_repository.py
reload_data ¶
Force reload database from file.
Clears cache and reloads data from CSV file.
Returns:
| Type | Description |
|---|---|
DataFrame | Freshly loaded database data |
Source code in src/infrastructure/persistence/csv_database_repository.py
merge_with_dataset ¶
Merge dataset with database.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_df | DataFrame | Input dataset (must have join column) | required |
on | str | Column name to join on | 'ko' |
how | str | Join type ('inner', 'left', 'right', 'outer') | 'inner' |
Returns:
| Type | Description |
|---|---|
DataFrame | Merged DataFrame |
Raises:
| Type | Description |
|---|---|
ValueError | If join column missing in either DataFrame |
Source code in src/infrastructure/persistence/csv_database_repository.py
get_column_names ¶
Get column names from database.
Returns:
| Type | Description |
|---|---|
list[str] | List of column names |
validate_schema ¶
Validate database schema.
Checks if all required columns are present in DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df | Optional[DataFrame] | DataFrame to validate (if None, loads from file) | None |
Returns:
| Type | Description |
|---|---|
bool | True if all required columns present, False otherwise |
Source code in src/infrastructure/persistence/csv_database_repository.py
get_stats ¶
Get database statistics.
Returns:
| Type | Description |
|---|---|
dict | Dictionary containing: - 'rows': Number of rows - 'columns': Number of columns - 'memory_mb': Memory usage in MB - 'column_names': List of column names - 'dtypes': Dictionary of column datatypes |
Source code in src/infrastructure/persistence/csv_database_repository.py
CSVDatabaseRepository¶
CSVDatabaseRepository ¶
CSVDatabaseRepository(filepath: Path, encoding: str = 'utf-8', separator: str = ';', required_columns: Optional[list[str]] = None)
Base implementation for CSV-based database repositories.
Provides common functionality for loading, caching, validating, and merging CSV databases. Specific database repositories inherit from this class.
Attributes:
| Name | Type | Description |
|---|---|---|
filepath | Path | Path to CSV database file |
encoding | str | File encoding (default: 'utf-8') |
separator | str | CSV separator (default: ';') |
required_columns | list[str] | List of required column names for validation |
_data | Optional[DataFrame] | Cached database data (lazy loaded) |
Methods:
| Name | Description |
|---|---|
load_data | Load CSV database with caching |
reload_data | Force reload database from file |
merge_with_dataset | Merge dataset with database |
get_column_names | Get column names from database |
validate_schema | Validate database schema |
get_stats | Get database statistics |
Notes
- Implements lazy loading with caching for performance
- Optimizes dtypes to reduce memory usage
Initialize CSV database repository.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
filepath | Path | Path to CSV file. | required |
encoding | str | File encoding. | 'utf-8' |
separator | str | CSV separator. | ';' |
required_columns | Optional[list[str]] | List of required column names for validation. | None |
Source code in src/infrastructure/persistence/csv_database_repository.py
Functions¶
load_data ¶
Load CSV database into DataFrame with caching.
Returns:
| Type | Description |
|---|---|
DataFrame | Database data with optimized dtypes |
Raises:
| Type | Description |
|---|---|
FileNotFoundError | If CSV file doesn't exist |
ValueError | If CSV format is invalid or required columns missing |
Source code in src/infrastructure/persistence/csv_database_repository.py
reload_data ¶
Force reload database from file.
Clears cache and reloads data from CSV file.
Returns:
| Type | Description |
|---|---|
DataFrame | Freshly loaded database data |
Source code in src/infrastructure/persistence/csv_database_repository.py
merge_with_dataset ¶
Merge dataset with database.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_df | DataFrame | Input dataset (must have join column) | required |
on | str | Column name to join on | 'ko' |
how | str | Join type ('inner', 'left', 'right', 'outer') | 'inner' |
Returns:
| Type | Description |
|---|---|
DataFrame | Merged DataFrame |
Raises:
| Type | Description |
|---|---|
ValueError | If join column missing in either DataFrame |
Source code in src/infrastructure/persistence/csv_database_repository.py
get_column_names ¶
Get column names from database.
Returns:
| Type | Description |
|---|---|
list[str] | List of column names |
validate_schema ¶
Validate database schema.
Checks if all required columns are present in DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df | Optional[DataFrame] | DataFrame to validate (if None, loads from file) | None |
Returns:
| Type | Description |
|---|---|
bool | True if all required columns present, False otherwise |
Source code in src/infrastructure/persistence/csv_database_repository.py
get_stats ¶
Get database statistics.
Returns:
| Type | Description |
|---|---|
dict | Dictionary containing: - 'rows': Number of rows - 'columns': Number of columns - 'memory_mb': Memory usage in MB - 'column_names': List of column names - 'dtypes': Dictionary of column datatypes |