KEGG Degradation Database Schema¶
Formal specification of the KEGG Degradation Database v1.0.0 schema
This document provides a comprehensive technical specification of the KEGG Degradation Database schema, containing genes involved in xenobiotic and pollutant degradation pathways extracted from KEGG.
Overview¶
The KEGG Degradation Database uses a flat, denormalized schema optimized for gene-pathway queries. The schema consists of a single table with 3 columns representing KO-pathway-genesymbol associations for degradation pathways.
Design rationale:
- Simplicity — Minimal 3-column structure for focused pathway analysis
- KEGG-sourced — All data extracted from official KEGG pathways
- Interoperability — Standard CSV format ensures compatibility with bioinformatics pipelines
- Cross-reference ready — KO identifiers enable integration with BioRemPP and HADEG databases
Schema Definition¶
Table: KEGG Degradation Database v1.0.0¶
Primary format: CSV (Comma-Separated Values)
Rows: 855 (as of v1.0.0)
Columns: 3
Character encoding: UTF-8
Field delimiter: Semicolon (;)
Text qualifier: Double quote (")
Header row: Yes (column names in first row)
Column Specifications¶
| # | Column Name | Data Type | Nullable | Controlled Vocabulary | Example Value |
|---|---|---|---|---|---|
| 1 | ko | Character | No | KEGG Orthology ID | K00001 |
| 2 | pathname | Character | No | Pathway name | Naphthalene |
| 3 | genesymbol | Character | No | Gene symbol | E1.1.1.1 |
Data completeness: 100% — Zero missing values across all fields
Column 1: ko¶
Column name: ko
Data type: Character (string)
Format: KEGG Orthology identifier
Pattern: K##### (capital K followed by exactly 5 digits)
Example: K00001, K00002, K00022
Controlled vocabulary: Yes — Values must exist in KEGG Orthology database
Uniqueness: Not unique — 517 unique KO entries, some appear in multiple pathways
Purpose: Links genes to functional orthologs in KEGG for cross-database integration
Cross-references:
- KEGG Orthology database:
https://www.kegg.jp/entry/K##### - KEGG Pathway maps
- BioRemPP Database (via
kocolumn) - HADEG Database (via
kocolumn)
Validation rules:
- ✅ Must match regex:
^K\d{5}$ - ✅ Must exist in KEGG Orthology database
- ✅ Case-sensitive (uppercase K required)
Cardinality: 517 unique KO entries in v1.0.0
Column 2: pathname¶
Column name: pathname
Data type: Character (string)
Format: Pathway name (human-readable)
Example: Naphthalene, Cytochrome P450, Aromatic, Chloroalkane
Controlled vocabulary: Partially — 20 standardized pathway names
Uniqueness: Not unique — Multiple genes belong to each pathway
Purpose: Categorizes genes by their KEGG degradation pathway
Valid pathway names:
| Pathway | Description |
|---|---|
Naphthalene | Naphthalene degradation |
Cytochrome P450 | Cytochrome P450 metabolism |
Aromatic | General aromatic degradation |
Benzoate | Benzoate degradation (map00362) |
Aminobenzoate | Aminobenzoate degradation |
Chloroalkane | Chloroalkane/chloroalkene degradation |
Chlorocyclohexane | Chlorocyclohexane/chlorobenzene |
Toluene | Toluene degradation |
Xylene | Xylene degradation |
Styrene | Styrene degradation |
Fluorobenzoate | Fluorobenzoate degradation |
Dioxin | Dioxin degradation |
Nitrotoluene | Nitrotoluene degradation |
Ethylbenzene | Ethylbenzene degradation |
Caprolactam | Caprolactam degradation |
Atrazine | Atrazine degradation |
Bisphenol | Bisphenol degradation |
Polycyclic | Polycyclic aromatic degradation |
Furfural | Furfural degradation |
Validation rules:
- ✅ Must be non-empty string
- ✅ Should match one of 20 documented pathways
Cardinality: 20 unique pathways in v1.0.0
Column 3: genesymbol¶
Column name: genesymbol
Data type: Character (string)
Format: Standard gene nomenclature (free text)
Example: E1.1.1.1, AKR1A1, HADH, nahAc
Controlled vocabulary: No — Free text, sourced from KEGG KO annotations
Uniqueness: Not unique — 513 unique symbols, some shared across pathways
Purpose: Provides short gene abbreviation for the associated KO group
Naming conventions:
- EC number format for enzymes (e.g.,
E1.1.1.1) - HUGO nomenclature for human genes (e.g.,
AKR1A1) - Organism-specific symbols for microbial genes (e.g.,
nahAc)
Validation rules:
- ✅ Must be non-empty string
- ✅ Should correspond to KEGG KO annotation
Cardinality: 513 unique gene symbols in v1.0.0
Schema Constraints and Relationships¶
Primary Key¶
None defined — Database uses a flat structure without explicit primary key
Conceptual primary key: Combination of (ko, pathname) approximates uniqueness
Foreign Key Relationships¶
Conceptual foreign keys (not enforced):
ko→ KEGG Orthology databaseko→ BioRemPP Databaseko→ HADEG Database
Cross-table joins:
Users can join KEGG Degradation Database with:
- BioRemPP Database — via
kocolumn (compound-gene relationships) - HADEG Database — via
kocolumn (pathway details) - KEGG Orthology — via
kocolumn
Cardinality Relationships¶
KO-to-Pathway¶
Many-to-Many — One KO can participate in multiple pathways, and one pathway contains multiple KOs
Example:
K00001→ 3 different pathwaysNaphthalene→ 50+ KO entries
Data Quality Specifications¶
Completeness¶
100% field completeness — Zero missing values across all 3 columns
Verification:
db <- read.csv("data/databases/kegg_degradation_db.csv", sep=";")
colSums(is.na(db)) # Should return all zeros
Consistency¶
Identifier format consistency:
- All
kovalues match^K\d{5}$
Verification:
Accuracy¶
Data provenance:
ko— Sourced from KEGG Orthology databasepathname— Extracted from KEGG Pathway databasegenesymbol— Sourced from KEGG KO annotations
Source: KEGG Pathway Database — Metabolism → Xenobiotics biodegradation and metabolism
Usage Examples¶
Loading the Database¶
import pandas as pd
# Load database
db = pd.read_csv("data/databases/kegg_degradation_db.csv", sep=";")
# Inspect schema
db.info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 855 entries, 0 to 854
# Data columns (total 3 columns):
# # Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 ko 855 non-null object
# 1 pathname 855 non-null object
# 2 genesymbol 855 non-null object
Common Queries¶
# Find all genes in naphthalene degradation
naphthalene_genes <- db[db$pathname == "Naphthalene", ]
# Count genes per pathway
table(db$pathname)
# Find pathways for a specific KO
ko_pathways <- unique(db[db$ko == "K00001", "pathname"])
Questions?¶
GitHub Issues: https://github.com/BioRemPP/biorempp_db/issues
Email: biorempp@gmail.com