Skip to content

UC-3.5 — Sample Similarity (Based on Chemical Profiles)

Module: 3 – System Structure: Clustering, Similarity, and Co-occurrence
Visualization type: Correlogram (sample × sample similarity heatmap in compound space)
Primary inputs: BioRemPP results table with sample and compoundname columns
Primary outputs: Pairwise similarity matrix of samples based on compound interaction profiles


Scientific Question and Rationale

Question: How similar are the samples to one another, based on the shared repertoire of chemical compounds they are co-annotated with?

This use case quantifies pairwise similarity between all biological samples using their compound co-annotation profiles. A correlogram (heatmap of a correlation matrix) is constructed from binary presence/absence profiles of compounds for each sample. The resulting visualization provides a compact, quantitative overview of how similar any two samples are in terms of the compounds they are co-annotated with in the database, which can help identify annotation-based sample groups and unique annotation profiles within the dataset.


Data and Inputs

  • Primary data source: BioRemPP_Results.xlsx or BioRemPP_Results.csv
  • Key columns:
  • sample – identifier for each biological sample
  • compoundname – name (or identifier) of the chemical compound associated with the sample
  • Accepted format: semicolon-delimited text table (.txt or .csv)
  • Derived structure: binary presence/absence matrix with:
  • rows = samples
  • columns = unique compound names
  • cell = 1 if the sample is associated with that compound, 0 otherwise

Analytical Workflow

  1. Data Loading
    The primary results table (BioRemPP_Results.xlsx or BioRemPP_Results.csv) is loaded into memory.

  2. Matrix Construction
    A binary presence/absence matrix is constructed where:

  3. rows correspond to Samples,
  4. columns correspond to unique compound names, and
  5. each cell is 1 if the sample is associated with that compound and 0 otherwise.

  6. Correlation Calculation
    A pairwise similarity matrix is computed by correlating the compound presence/absence vectors (rows) for every pair of samples. Typically:

  7. the Pearson correlation coefficient is calculated between each pair of sample vectors,
  8. this yields a square matrix where each cell (i, j) represents the similarity score between Sample i and Sample j based on their compound profiles.

  9. Rendering
    The resulting sample-by-sample correlation matrix is rendered as a heatmap (correlogram):

  10. both axes list the same set of samples,
  11. cell colors encode correlation values, and
  12. a color bar indicates the numerical range of correlation coefficients.

How to Read the Plot

  • X-axis and Y-axis (Samples)
    Both axes represent the same set of Samples. The cell at row i, column j shows the similarity between those two samples.

  • Cell Color The color at each cell encodes the correlation coefficient between the compound co-annotation profiles of the two samples:

  • warm colors (e.g., reds) indicate high positive similarity (samples are co-annotated with very similar sets of compounds),
  • cooler or neutral colors indicate lower similarity.

  • Color Scale
    A diverging color scale is typically used:

  • warm colors highlight strong similarity,
  • neutral or cool colors indicate weaker similarity.
    The main diagonal is always at the maximum value, as each sample is perfectly correlated with itself.

Representative Output

The image below illustrates a representative output generated by this use case using the example dataset.

Click on the image to enlarge and explore details.

Representative output for UC-3.5


Interpretation and Key Messages

  • Compound Co-annotation Clusters Brightly colored blocks or patches off the main diagonal may identify clusters of samples with highly similar compound co-annotation profiles. These clusters represent groups of samples that share overlapping compound co-annotations in the database.

  • Distinct Annotation Groups The overall structure of the heatmap can reveal distinct groups of samples with different compound co-annotation patterns:

  • separate warm-colored regions could indicate sets of samples primarily co-annotated with different compound subsets or classes,
  • transitions between regions may suggest differences in compound annotation coverage.

  • Unique Annotation Profiles Samples whose row/column is dominated by neutral or cool colors (low correlations with most other samples) may exhibit unique or rare compound co-annotation profiles within the dataset. These may warrant focused investigation.


Reproducibility and Assumptions

  • Input Format
    The analysis assumes a semicolon-delimited table containing at least the columns sample and compoundname.

  • Binary Representation
    The similarity calculation is based on binary presence/absence of compounds. Multiple occurrences of the same compound for a given sample (e.g., through different genes or pathways) are collapsed into a single presence (1).

  • Similarity Metric
    Similarity is quantified using the Pearson correlation coefficient applied to binary vectors. This metric captures linear co-variation in compound repertoires; while widely used, alternative metrics (e.g., Jaccard similarity) may be explored in complementary analyses.

  • Interpretation Scope The correlogram reflects similarity in compound co-annotation profiles, not interaction strength, kinetic efficiency, or environmental abundance of compounds. These aspects require additional experimental data and analyses beyond the presence/absence of compoundname.


Activity diagram of the use case

Click on the image to enlarge and explore details.

Activity diagram of the use case