ToxCSM Repository¶
toxcsm_repository ¶
ToxCSM Repository - Toxicity Prediction Database Access.
Provides repository implementation for accessing ToxCSM (Toxicity Prediction) database containing toxicity predictions for environmental compounds.
Classes:
| Name | Description |
|---|---|
ToxCSMRepository | Repository for ToxCSM toxicity prediction database |
Classes¶
ToxCSMRepository ¶
ToxCSMRepository(filepath: Path = Path('data/databases/toxcsm_db.csv'), encoding: str = 'utf-8', separator: str = ';')
Bases: CSVDatabaseRepository
Repository for ToxCSM toxicity prediction database.
Provides access to compound-level toxicity predictions. Database file: data/databases/toxcsm_db.csv
Attributes:
| Name | Type | Description |
|---|---|---|
filepath | Path | Path to ToxCSM database CSV file |
encoding | str | File encoding (default: 'utf-8') |
separator | str | CSV separator (default: ';') |
required_columns | list[str] | Required columns: ['cpd'] |
Methods:
| Name | Description |
|---|---|
merge_with_compound_data | Merge compound data with toxicity predictions |
Notes
- Merges on 'cpd' column instead of 'ko' (compound-level data)
Initialize ToxCSM repository.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
filepath | Path | Path to ToxCSM database CSV file. | Path('data/databases/toxcsm_db.csv') |
encoding | str | File encoding. | 'utf-8' |
separator | str | CSV separator. | ';' |
Source code in src/infrastructure/persistence/toxcsm_repository.py
Functions¶
merge_with_compound_data ¶
merge_with_compound_data(compound_df: DataFrame, on: str = 'cpd', how: str = 'left') -> pd.DataFrame
Merge compound data with toxicity predictions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
compound_df | DataFrame | DataFrame containing compound information (must have join column) | required |
on | str | Column to join on | 'cpd' |
how | str | Join type (default 'left' keeps all compounds) | 'left' |
Returns:
| Type | Description |
|---|---|
DataFrame | Merged DataFrame with toxicity predictions |
Source code in src/infrastructure/persistence/toxcsm_repository.py
load_data ¶
Load CSV database into DataFrame with caching.
Returns:
| Type | Description |
|---|---|
DataFrame | Database data with optimized dtypes |
Raises:
| Type | Description |
|---|---|
FileNotFoundError | If CSV file doesn't exist |
ValueError | If CSV format is invalid or required columns missing |
Source code in src/infrastructure/persistence/csv_database_repository.py
reload_data ¶
Force reload database from file.
Clears cache and reloads data from CSV file.
Returns:
| Type | Description |
|---|---|
DataFrame | Freshly loaded database data |
Source code in src/infrastructure/persistence/csv_database_repository.py
merge_with_dataset ¶
Merge dataset with database.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_df | DataFrame | Input dataset (must have join column) | required |
on | str | Column name to join on | 'ko' |
how | str | Join type ('inner', 'left', 'right', 'outer') | 'inner' |
Returns:
| Type | Description |
|---|---|
DataFrame | Merged DataFrame |
Raises:
| Type | Description |
|---|---|
ValueError | If join column missing in either DataFrame |
Source code in src/infrastructure/persistence/csv_database_repository.py
get_column_names ¶
Get column names from database.
Returns:
| Type | Description |
|---|---|
list[str] | List of column names |
validate_schema ¶
Validate database schema.
Checks if all required columns are present in DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df | Optional[DataFrame] | DataFrame to validate (if None, loads from file) | None |
Returns:
| Type | Description |
|---|---|
bool | True if all required columns present, False otherwise |
Source code in src/infrastructure/persistence/csv_database_repository.py
get_stats ¶
Get database statistics.
Returns:
| Type | Description |
|---|---|
dict | Dictionary containing: - 'rows': Number of rows - 'columns': Number of columns - 'memory_mb': Memory usage in MB - 'column_names': List of column names - 'dtypes': Dictionary of column datatypes |