KEGG Degradation Database Schema¶

Formal specification of the KEGG Degradation Database v1.0.0 schema

This document provides a comprehensive technical specification of the KEGG Degradation Database schema, containing genes involved in xenobiotic and pollutant degradation pathways extracted from KEGG.

Overview¶

The KEGG Degradation Database uses a flat, denormalized schema optimized for gene-pathway queries. The schema consists of a single table with 3 columns representing KO-pathway-genesymbol associations for degradation pathways.

Design rationale:

Simplicity — Minimal 3-column structure for focused pathway analysis
KEGG-sourced — All data extracted from official KEGG pathways
Interoperability — Standard CSV format ensures compatibility with bioinformatics pipelines
Cross-reference ready — KO identifiers enable integration with BioRemPP and HADEG databases

Schema Definition¶

Table: KEGG Degradation Database v1.0.0¶

Primary format: CSV (Comma-Separated Values)

Rows: 855 (as of v1.0.0)

Columns: 3

Character encoding: UTF-8

Field delimiter: Semicolon (;)

Text qualifier: Double quote (")

Header row: Yes (column names in first row)

Column Specifications¶

#	Column Name	Data Type	Nullable	Controlled Vocabulary	Example Value
1	`ko`	Character	No	KEGG Orthology ID	`K00001`
2	`pathname`	Character	No	Pathway name	`Naphthalene`
3	`genesymbol`	Character	No	Gene symbol	`E1.1.1.1`

Data completeness: 100% — Zero missing values across all fields

Column 1: `ko`¶

Column name: ko

Data type: Character (string)

Format: KEGG Orthology identifier

Pattern: K##### (capital K followed by exactly 5 digits)

Example: K00001, K00002, K00022

Controlled vocabulary: Yes — Values must exist in KEGG Orthology database

Uniqueness: Not unique — 517 unique KO entries, some appear in multiple pathways

Purpose: Links genes to functional orthologs in KEGG for cross-database integration

Cross-references:

KEGG Orthology database: https://www.kegg.jp/entry/K#####
KEGG Pathway maps
BioRemPP Database (via ko column)
HADEG Database (via ko column)

Validation rules:

✅ Must match regex: ^K\d{5}$
✅ Must exist in KEGG Orthology database
✅ Case-sensitive (uppercase K required)

Cardinality: 517 unique KO entries in v1.0.0

Column 2: `pathname`¶

Column name: pathname

Data type: Character (string)

Format: Pathway name (human-readable)

Example: Naphthalene, Cytochrome P450, Aromatic, Chloroalkane

Controlled vocabulary: Partially — 20 standardized pathway names

Uniqueness: Not unique — Multiple genes belong to each pathway

Purpose: Categorizes genes by their KEGG degradation pathway

Valid pathway names:

Pathway	Description
`Naphthalene`	Naphthalene degradation
`Cytochrome P450`	Cytochrome P450 metabolism
`Aromatic`	General aromatic degradation
`Benzoate`	Benzoate degradation (map00362)
`Aminobenzoate`	Aminobenzoate degradation
`Chloroalkane`	Chloroalkane/chloroalkene degradation
`Chlorocyclohexane`	Chlorocyclohexane/chlorobenzene
`Toluene`	Toluene degradation
`Xylene`	Xylene degradation
`Styrene`	Styrene degradation
`Fluorobenzoate`	Fluorobenzoate degradation
`Dioxin`	Dioxin degradation
`Nitrotoluene`	Nitrotoluene degradation
`Ethylbenzene`	Ethylbenzene degradation
`Caprolactam`	Caprolactam degradation
`Atrazine`	Atrazine degradation
`Bisphenol`	Bisphenol degradation
`Polycyclic`	Polycyclic aromatic degradation
`Furfural`	Furfural degradation

Validation rules:

✅ Must be non-empty string
✅ Should match one of 20 documented pathways

Cardinality: 20 unique pathways in v1.0.0

Column 3: `genesymbol`¶

Column name: genesymbol

Data type: Character (string)

Format: Standard gene nomenclature (free text)

Example: E1.1.1.1, AKR1A1, HADH, nahAc

Controlled vocabulary: No — Free text, sourced from KEGG KO annotations

Uniqueness: Not unique — 513 unique symbols, some shared across pathways

Purpose: Provides short gene abbreviation for the associated KO group

Naming conventions:

EC number format for enzymes (e.g., E1.1.1.1)
HUGO nomenclature for human genes (e.g., AKR1A1)
Organism-specific symbols for microbial genes (e.g., nahAc)

Validation rules:

✅ Must be non-empty string
✅ Should correspond to KEGG KO annotation

Cardinality: 513 unique gene symbols in v1.0.0

Schema Constraints and Relationships¶

Primary Key¶

None defined — Database uses a flat structure without explicit primary key

Conceptual primary key: Combination of (ko, pathname) approximates uniqueness

Foreign Key Relationships¶

Conceptual foreign keys (not enforced):

ko → KEGG Orthology database
ko → BioRemPP Database
ko → HADEG Database

Cross-table joins:

Users can join KEGG Degradation Database with:

BioRemPP Database — via ko column (compound-gene relationships)
HADEG Database — via ko column (pathway details)
KEGG Orthology — via ko column

Cardinality Relationships¶

KO-to-Pathway¶

Many-to-Many — One KO can participate in multiple pathways, and one pathway contains multiple KOs

Example:

K00001 → 3 different pathways
Naphthalene → 50+ KO entries

Data Quality Specifications¶

Completeness¶

100% field completeness — Zero missing values across all 3 columns

Verification:

db <- read.csv("data/databases/kegg_degradation_db.csv", sep=";")
colSums(is.na(db))  # Should return all zeros

Consistency¶

Identifier format consistency:

All ko values match ^K\d{5}$

Verification:

# Verify KO ID format
all(grepl("^K\\\\d{5}$", db$ko))  # Should return TRUE

Accuracy¶

Data provenance:

ko — Sourced from KEGG Orthology database
pathname — Extracted from KEGG Pathway database
genesymbol — Sourced from KEGG KO annotations

Source: KEGG Pathway Database — Metabolism → Xenobiotics biodegradation and metabolism

Usage Examples¶

Loading the Database¶

RPython

# Load database
library(readr)
db <- read_delim("data/databases/kegg_degradation_db.csv", delim=";")

# Inspect schema
str(db)
# tibble [855 × 3]
#  $ ko         : chr  "K00001" "K00002" "K00022" ...
#  $ pathname   : chr  "Naphthalene" "Cytochrome P450" ...
#  $ genesymbol : chr  "E1.1.1.1" "AKR1A1" ...

import pandas as pd

# Load database
db = pd.read_csv("data/databases/kegg_degradation_db.csv", sep=";")

# Inspect schema
db.info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 855 entries, 0 to 854
# Data columns (total 3 columns):
#  #   Column      Non-Null Count  Dtype 
# ---  ------      --------------  ----- 
#  0   ko          855 non-null    object
#  1   pathname    855 non-null    object
#  2   genesymbol  855 non-null    object

Common Queries¶

# Find all genes in naphthalene degradation
naphthalene_genes <- db[db$pathname == "Naphthalene", ]

# Count genes per pathway
table(db$pathname)

# Find pathways for a specific KO
ko_pathways <- unique(db[db$ko == "K00001", "pathname"])

Questions?¶

GitHub Issues: https://github.com/BioRemPP/biorempp_db/issues
Email: biorempp@gmail.com

KEGG Degradation Database Schema¶

Overview¶

Schema Definition¶

Table: KEGG Degradation Database v1.0.0¶

Column Specifications¶

Column 1: ko¶

Column 2: pathname¶

Column 3: genesymbol¶

Schema Constraints and Relationships¶

Primary Key¶

Foreign Key Relationships¶

Cardinality Relationships¶

KO-to-Pathway¶

Data Quality Specifications¶

Completeness¶

Consistency¶

Accuracy¶

Usage Examples¶

Loading the Database¶

Common Queries¶

Questions?¶

Column 1: `ko`¶

Column 2: `pathname`¶

Column 3: `genesymbol`¶