Skip to content

UC-3.4 — Sample Similarity (Based on KO Profiles)

Module: 3 – System Structure: Clustering, Similarity, and Co-occurrence
Visualization type: Correlogram (sample × sample similarity heatmap in KO space)
Primary inputs: BioRemPP results table with sample and ko columns
Primary outputs: Pairwise similarity matrix of samples based on KO presence/absence


Scientific Question and Rationale

Question: How similar are the samples to one another in terms of their shared KEGG Orthology (KO) annotation profiles?

This use case quantifies the pairwise KO annotation similarity between all biological samples using their KO annotation profiles. A correlogram (heatmap of a correlation matrix) is constructed from binary presence/absence KO profiles for each sample. The resulting visualization provides a compact, quantitative view of how similar any two samples are in terms of their KO co-annotation patterns, which can reveal coherent annotation-based groups, transitions, and outliers within the dataset.


Data and Inputs

  • Primary data source: BioRemPP_Results.xlsx or BioRemPP_Results.csv
  • Key columns:
  • sample – identifier for each biological sample
  • ko – KEGG Orthology identifier associated with the sample
  • Accepted format: semicolon-delimited text table (.txt or .csv)
  • Derived structure: binary presence/absence matrix with:
  • rows = samples
  • columns = unique KOs
  • cell = 1 if the sample possesses that KO, 0 otherwise

Analytical Workflow

  1. Data Loading
    The primary results table (BioRemPP_Results.xlsx or BioRemPP_Results.csv) is loaded into memory.

  2. Matrix Construction
    A binary presence/absence matrix is constructed where:

  3. rows correspond to Samples,
  4. columns correspond to unique KOs, and
  5. each cell is 1 if the sample possesses the KO and 0 otherwise.

  6. Correlation Calculation
    A pairwise similarity matrix is computed by correlating the presence/absence vectors (rows) for every pair of samples. Typically:

  7. the Pearson correlation coefficient is calculated between each pair of sample vectors,
  8. this yields a square matrix where each cell (i, j) represents the similarity score between Sample i and Sample j based on their KO profiles.

  9. Rendering
    The resulting sample-by-sample correlation matrix is rendered as a heatmap (correlogram):

  10. both axes list the same set of samples,
  11. cell colors encode correlation values, and
  12. a color bar indicates the numerical range of correlation coefficients.

How to Read the Plot

  • X-axis and Y-axis (Samples)
    Both axes represent the same set of Samples. The cell at row i, column j shows the similarity between those two samples.

  • Cell Color
    The color at each cell encodes the correlation coefficient between the KO presence/absence profiles of the two samples:

  • higher positive correlation (stronger similarity) is shown by warmer or more intense colors,
  • lower correlation (weaker similarity) is shown by cooler or more neutral colors.

  • Color Scale
    A diverging color scale is typically used:

  • warm colors (e.g., reds) indicate high positive similarity,
  • neutral colors indicate intermediate similarity,
  • cool colors (e.g., blues) indicate low or potentially negative correlation.
    The main diagonal is always at the maximum value (correlation of a sample with itself).

Representative Output

The image below illustrates a representative output generated by this use case using the example dataset.

Click on the image to enlarge and explore details.

Representative output for UC-3.4


Interpretation and Key Messages

  • KO Annotation Clusters Blocks or patches of warm colors off the main diagonal may indicate clusters of samples with highly similar KO annotation profiles. Such clusters could correspond to samples sharing comparable KO annotation repertoires and may warrant joint investigation.

  • Distinct Annotation Groups The large-scale pattern of the heatmap can reveal annotation-based distinct groups:

  • tight, warm-colored submatrices may signal sets of samples more similar to each other than to the rest,
  • boundaries between warm and cooler regions could suggest transitions between major annotation-based groups.

  • Unique Annotation Profiles Samples that show generally low similarity (cooler colors) across their row/column may have unique or distinct KO annotation profiles in the context of this dataset, making them potential candidates for focused investigation.


Reproducibility and Assumptions

  • Input Format
    The analysis assumes a semicolon-delimited table containing at least the columns sample and ko.

  • Binary Representation
    The similarity calculation is based on binary presence/absence of KOs. Multiple occurrences of the same KO within a sample are collapsed into a single presence (1).

  • Similarity Metric
    The default similarity metric is the Pearson correlation coefficient applied to binary vectors, which measures the linear relationship between KO repertoires. While suitable for many use cases, alternative metrics (e.g., Jaccard similarity) may be considered in complementary analyses.

  • Interpretation Scope The correlogram reflects similarity in KO annotation profiles, not expression levels, gene copy numbers, or kinetic parameters. These aspects require additional layers of experimental data and analysis beyond the presence/absence of KO identifiers.


Activity diagram of the use case

Click on the image to enlarge and explore details.

Activity diagram of the use case