ToxCSM Database Schema¶

Formal specification of the ToxCSM Database v1.0.0 schema

This document provides a comprehensive technical specification of the ToxCSM (Toxicity Computational Structural Model) Database schema, containing computational toxicological predictions for environmental compounds.

Overview¶

The ToxCSM Database uses a wide-format schema optimized for multi-endpoint toxicological analysis. The schema consists of a single table with 66 columns representing compound identifiers, structural information, and 31 toxicological endpoint predictions (both labels and numeric values).

Design rationale:

Comprehensive — 31 toxicological endpoints with both categorical labels and numeric scores
Multi-database integration — KEGG, ChEBI, and SMILES identifiers enable cross-referencing
Risk assessment — Categories (High Safety, Medium Safety, Low Safety, etc.) enable rapid risk classification

Schema Definition¶

Table: ToxCSM Database v1.0.0¶

Primary format: CSV (Comma-Separated Values)

Rows: 370 (as of v1.0.0)

Columns: 66

Character encoding: UTF-8

Field delimiter: Semicolon (;)

Text qualifier: Double quote (")

Header row: Yes (column names in first row)

Column Organization¶

The 66 columns are organized into three categories:

1. Compound Identifiers (4 columns)¶

#	Column Name	Data Type	Description
1	`SMILES`	Character	Simplified Molecular Input Line Entry System
2	`cpd`	Character	KEGG Compound ID
3	`ChEBI`	Character	Chemical Entities of Biological Interest ID
4	`compoundname`	Character	Human-readable compound name

2. Toxicity Labels (31 columns)¶

Categorical predictions with 6 possible values: High Safety, Medium Safety, Low Safety, Low Toxicity, Medium Toxicity, High Toxicity

3. Toxicity Values (31 columns)¶

Numeric scores (0.0 - 1.0) representing prediction confidence/probability

Identifier Columns¶

Column 1: `SMILES`¶

Column name: SMILES

Data type: Character (string)

Format: SMILES notation (Simplified Molecular Input Line Entry System)

Example: [Ni+2], C[C@@H](Oc1ccc(Cl)cc1Cl)C(=O)O

Controlled vocabulary: No — Chemical structure notation

Uniqueness: Unique — 370 unique SMILES (one per compound)

Purpose: Machine-readable molecular structure representation for cheminformatics

Validation rules:

✅ Must be valid SMILES notation
✅ Must be non-empty string

Cardinality: 370 unique SMILES in v1.0.0

Column 2: `cpd`¶

Column name: cpd

Data type: Character (string)

Format: KEGG Compound identifier

Pattern: C##### (capital C followed by exactly 5 digits)

Example: C19609, C19610, C20685

Controlled vocabulary: Yes — Values must exist in KEGG Compound database

Uniqueness: Unique — 370 unique compound IDs

Purpose: Links to KEGG Compound database for cross-reference

Cross-references:

KEGG Compound database: https://www.kegg.jp/entry/C#####
BioRemPP Database (via cpd column)

Validation rules:

✅ Must match regex: ^C\d{5}$
✅ Must exist in KEGG Compound database

Cardinality: 370 unique compounds in v1.0.0

Column 3: `ChEBI`¶

Column name: ChEBI

Data type: Character (string)

Format: ChEBI identifier

Pattern: CHEBI:##### (CHEBI prefix followed by numeric ID)

Example: CHEBI:29035, CHEBI:49786, CHEBI:75288

Controlled vocabulary: Yes — Values must exist in ChEBI database

Uniqueness: Unique — 370 unique ChEBI IDs

Purpose: Links to ChEBI (Chemical Entities of Biological Interest) database

Cross-references:

ChEBI database: https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:#####

Cardinality: 370 unique ChEBI IDs in v1.0.0

Column 4: `compoundname`¶

Column name: compoundname

Data type: Character (string)

Format: Free text (chemical nomenclature)

Example: Nickel(2+), Manganese(2+), (R)-(2,4-Dichlorophenoxy)propanoic acid

Controlled vocabulary: No — Free text from KEGG

Uniqueness: Unique — 370 unique names

Purpose: Human-readable compound name for display and search

Cardinality: 370 unique names in v1.0.0

Toxicity Endpoint Columns¶

Label Columns (Categorical)¶

All label columns use a 6-level categorical scale:

Category	Interpretation
`High Safety`	Minimal toxicity concern
`Medium Safety`	Low toxicity concern
`Low Safety`	Moderate toxicity concern
`Low Toxicity`	Moderate toxicity observed
`Medium Toxicity`	Significant toxicity
`High Toxicity`	Severe toxicity concern

Toxicological Endpoints (31 total)¶

Endpoint Category	Endpoint	Label Column	Value Column	Samples	Source
Environmental
	Avian	`label_Avian`	`value_Avian`	591 (501/90)	Zhang et al., 2015
	Biodegradation	`label_Biodegradation`	`value_Biodegradation`	1592 (1007/585)	Cheng et al., 2012
	Crustacean	`label_Crustacean`	`value_Crustacean`	1020 (487/533)	Cao et al., 2018
	Fathead Minnow	`label_Fathead_Minnow`	`value_Fathead_Minnow`	554 (188/366)	Fei-xiong et al., 2010
	Honey Bee	`label_Honey_Bee`	`value_Honey_Bee`	195 (96/99)	Fei-xiong et al., 2010
	T. Pyriformis	`label_T._Pyriformis`	`value_T._Pyriformis`	1571 (354/1217)	Cheng et al., 2011
Genomic
	AMES Mutagenesis	`label_Genomic_AMES_Mutagenesis`	`value_Genomic_AMES_Mutagenesis`	8102 (3470/4632)	Xu et al., 2012
	Carcinogenesis	`label_Genomic_Carcinogenesis`	`value_Genomic_Carcinogenesis`	278 (220/58)	Li et al., 2015
	Micronucleus	`label_Genomic_Micronucleus`	`value_Genomic_Micronucleus`	641 (377/264)	Fan et al., 2018
Organic
	Eye Corrosion	`label_Eye_Corrosion`	`value_Eye_Corrosion`	2299 (1412/887)	Wang et al., 2017
	Eye Irritation	`label_Eye_Irritation`	`value_Eye_Irritation`	5220 (1346/3874)	Wang et al., 2017
	hERG I Inhibitor	`label_hERG_I_Inhibitor`	`value_hERG_I_Inhibitor`	368 (289/79)	Marchese et al., 2011
	hERG II Inhibitor	`label_hERG_II_Inhibitor`	`value_hERG_II_Inhibitor`	806 (373/433)	Wang et al., 2012
	Liver Injury I	`label_Liver_Injury_I`	`value_Liver_Injury_I`	515 (277/238)	Fourches et al., 2010
	Liver Injury II	`label_Liver_Injury_II`	`value_Liver_Injury_II`	960 (670/290)	Mulliner et al., 2016
	Respiratory Disease	`label_Respiratory_Disease`	`value_Respiratory_Disease`	2529 (1089/1440)	Wang et al., 2021
	Skin Sensitisation	`label_Skin_Sensitisation`	`value_Skin_Sensitisation`	404 (130/274)	Alves et al, 2015
Nuclear Response
	NR-AhR	`label_NR_AhR`	`value_NR_AhR`	6876 (6088/788)	Huang et al., 2016
	NR-AR	`label_NR_AR`	`value_NR_AR`	7583 (7307/276)	Huang et al., 2016
	NR-AR-LBD	`label_NR_AR_LBD`	`value_NR_AR_LBD`	6922 (6698/224)	Huang et al., 2016
	NR-Aromatase	`label_NR_Aromatase`	`value_NR_Aromatase`	5961 (5664/297)	Huang et al., 2016
	NR-ER	`label_NR_ER`	`value_NR_ER`	6437 (5725/712)	Huang et al., 2016
	NR-ER-LBD	`label_NR_ER_LBD`	`value_NR_ER_LBD`	7133 (6813/320)	Huang et al., 2016
	NR-GR	`label_NR_GR`	`value_NR_GR`	6055 (5454/601)	Sun et al., 2019
	NR-PPAR-gamma	`label_NR_PPAR_gamma`	`value_NR_PPAR_gamma`	6602 (6418/184)	Huang et al., 2016
	NR-TR	`label_NR_TR`	`value_NR_TR`	5372 (4999/373)	Sun et al., 2019
Stress Response
	SR-ARE	`label_SR_ARE`	`value_SR_ARE`	5956 (5014/942)	Huang et al., 2016
	SR-ATAD5	`label_SR_ATAD5`	`value_SR_ATAD5`	7251 (6999/252)	Huang et al., 2016
	SR-HSE	`label_SR_HSE`	`value_SR_HSE`	6613 (6257/356)	Huang et al., 2016
	SR-MMP	`label_SR_MMP`	`value_SR_MMP`	6094 (5163/931)	Huang et al., 2016
	SR-p53	`label_SR_p53`	`value_SR_p53`	7112 (6686/426)	Huang et al., 2016

Value Columns (Numeric)¶

All value columns contain:

Data type: float64
Range: 0.0 - 1.0
Interpretation: Higher values indicate stronger prediction confidence
Usage: Threshold-based classification, ranking, quantitative risk assessment

Schema Constraints and Relationships¶

Primary Key¶

Conceptual primary key: cpd (KEGG Compound ID) — unique per row

Foreign Key Relationships¶

Conceptual foreign keys:

cpd → KEGG Compound database
cpd → BioRemPP Database
ChEBI → ChEBI database

Data Quality Specifications¶

Completeness¶

100% field completeness — Zero missing values across all 66 columns

Verification¶

db <- read.csv("data/databases/toxcsm_db.csv", sep=";")
colSums(is.na(db))  # Should return all zeros

Data Provenance¶

SMILES, cpd, compoundname — From KEGG Compound database
ChEBI — Cross-reference to ChEBI database
Toxicity predictions — Generated by ToxCSM QSAR models

Source: ToxCSM (Toxicity Computational Structural Model) web server predictions

Usage Examples¶

Loading the Database¶

RPython

library(readr)
db <- read_delim("data/databases/toxcsm_db.csv", delim=";")

# View identifier columns
head(db[, c("cpd", "compoundname", "ChEBI")])

# View ecotoxicity labels
head(db[, c("compoundname", "label_Avian", "label_Crustacean")])

import pandas as pd

db = pd.read_csv("data/databases/toxcsm_db.csv", sep=";")

# Filter high toxicity compounds
high_tox = db[db['label_hERG_I_Inhibitor'] == 'High Toxicity']

# Get average toxicity scores
value_cols = [c for c in db.columns if c.startswith('value_')]
db[value_cols].mean()

Common Queries¶

# Find compounds with high cardiotoxicity risk
high_herg <- db[db$label_hERG_I_Inhibitor == "High Toxicity", ]

# Get all ecotoxicity labels for a compound
compound <- db[db$compoundname == "Benzene", 
               grep("label_", names(db))]

# Rank compounds by biodegradation
ranked <- db[order(-db$value_Biodegradation), 
             c("compoundname", "value_Biodegradation")]

Questions?¶

GitHub Issues: https://github.com/BioRemPP/biorempp_db/issues
Email: biorempp@gmail.com

ToxCSM Database Schema¶

Overview¶

Schema Definition¶

Table: ToxCSM Database v1.0.0¶

Column Organization¶

1. Compound Identifiers (4 columns)¶

2. Toxicity Labels (31 columns)¶

3. Toxicity Values (31 columns)¶

Identifier Columns¶

Column 1: SMILES¶

Column 2: cpd¶

Column 3: ChEBI¶

Column 4: compoundname¶

Toxicity Endpoint Columns¶

Label Columns (Categorical)¶

Toxicological Endpoints (31 total)¶

Value Columns (Numeric)¶

Schema Constraints and Relationships¶

Primary Key¶

Foreign Key Relationships¶

Data Quality Specifications¶

Completeness¶

Verification¶

Data Provenance¶

Usage Examples¶

Loading the Database¶

Common Queries¶

Questions?¶

Column 1: `SMILES`¶

Column 2: `cpd`¶

Column 3: `ChEBI`¶

Column 4: `compoundname`¶