3 - System Structure: Clustering, Similarity, and Co-occurrence¶

This module moves from describing individual entities to characterizing the emergent organization of the system as a whole. After ranking samples and compounds by their KO annotation counts and compound co-annotation breadth, the next step is to understand how these entities group, relate, and co-vary in a higher-dimensional space. Here, we adopt a systems-level perspective to reveal clusters, similarity gradients, and co-occurrence patterns that define the annotation structure of the dataset. These structural insights are useful for identifying annotation-based sample groups, shared co-annotation patterns, and recurring molecular co-occurrence structures that may be worth investigating experimentally.

3.1. How do the samples organize into KO annotation and compound co-annotation clusters?¶

We first address the global organization of the samples in both KO and compound annotation terms. To do so, we apply multivariate dimensionality-reduction and clustering methods. Principal Component Analysis (PCA) is used on the KO annotation and compound co-annotation profiles to visualize dominant axes of variation and to highlight groups of samples with similar annotation patterns. We then complement this with hierarchical clustering, generating dendrograms that reveal nested, fine-grained relationships among sample groups. Together, these approaches can enable the identification of distinct sample clusters—groups of samples that appear to share comparable KO annotation profiles or compound co-annotation repertoires.

3.2. What is the quantitative similarity between any two samples?¶

Once distinct clusters have been defined, we quantify the strength and structure of these relationships. We construct correlograms that assign numerical similarity scores to all pairwise combinations of samples. This analysis is carried out from two complementary perspectives: one based on shared KO annotations (KO Richness) and another based on shared compound co-annotations (Compound Richness). The result is a set of similarity matrices that can provide statistical support for the sample groups identified in the clustering step, while also revealing intermediate degrees of relatedness that may not be immediately apparent from visual inspection alone.

3.3. What are the underlying molecular and chemical co-occurrence patterns that drive these sample similarities?¶

To examine the annotation basis of the observed sample groups, we investigate co-occurrence patterns among the core molecular and chemical features. Using correlograms and related association metrics, we ask two key questions: which genes tend to co-occur across samples, and which compounds are frequently co-annotated together? From this, we can characterize potential co-annotation clusters (sets of correlated genes or KOs) and compound co-annotation sets (sets of co-annotated compounds) that recur across samples. These patterns can provide an annotation-level rationale for the sample groupings, linking the emergent dataset structure to specific co-annotation patterns that warrant experimental follow-up.