USTAR-CR: Efficient and Compact Compression of k -Mer Sets Through Colored de Bruijn Graphs

Abstract

A core task in computational genomics is transforming input sequences into their constituent k-mers. Efficiently storing these k-mer collections is crucial for scaling bioinformatics workflows. A common strategy involves representing the k-mers as a de Bruijn graph (dBG) and deriving a compact plain text form through a minimum path cover. In this article, we introduce USTAR-CR (Unitig STitch Advanced constRuction with Colors Reordering), a fast and space-efficient algorithm for compressing multiple k-mer sets. USTAR-CR exploits the structural properties of colored dBGs to construct a succinct plain text representation while also incorporating an effective scheme for encoding k-mer color information. We evaluate USTAR-CR on real sequencing datasets and benchmark it against the state-of-the-art tool GGCAT. USTAR-CR achieves superior compression ratios, significantly reduces memory usage, and offers substantial speed improvements—up to 64× faster—highlighting its effectiveness for large-scale genomic data processing.

Keywords

colored Bruijn graphs compression plain text representation

1. INTRODUCTION

k-mer-based algorithms have become central tools in bioinformatics, offering scalable and efficient alternatives to traditional sequence alignment approaches. By operating directly on sets of k-mer substrings, these methods bypass the need for full read alignment and enable fast, memory-efficient analysis pipelines. Over the past decade, they have seen widespread adoption due to their conceptual simplicity and ability to handle large-scale data. These approaches have demonstrated exceptional performance across a range of applications. In genome assembly, tools such as SPAdes (Bankevich et al., 2012) leverage k-mer-based strategies to reconstruct complete genomes with high accuracy. In metagenomics, a variety of tools (Wood and Salzberg, 2014; Andreace et al., 2021; Qian and Comin, 2019; Cavattoni and Comin, 2023; Storato and Comin, 2022) employ k-mers to classify microbial content in complex samples, achieving speedups of up to 900× compared to traditional methods like MegaBLAST. Similarly, in genotyping, k-mer-centric tools (Denti et al., 2019; Sun and Medvedev, 2019; Marcolin et al., 2022; Monsu and Comin, 2021) identify genetic variants across individuals and populations without reliance on full-sequence alignment. In phylogenomics, Mash (Ondov et al., 2016) uses k-mers to estimate genomic distances, facilitating fast evolutionary analysis. Numerous other tools (Harris and Medvedev, 2020; Marchet et al., 2020) also apply k-mer techniques to enable rapid sequence search over large genomic databases.

The scalability of k-mer methods lies in their ability to handle datasets containing billions of k-mers. A key factor in their performance is how these sets are represented. Depending on the application, this choice balances between time-efficient representations with good cache locality and compressed formats that reduce memory footprint. As sequencing data grows, minimizing the space required to store and query k-mer sets has become an active area of research. Conway and Bromage (2011) showed that storing n k-mers requires at least $\log_{2} (\begin{matrix} 4^{k} \\ n \end{matrix})$ bits in the worst case, for the DNA alphabet. However, real-world datasets typically contain redundant or overlapping k-mers, enabling more compact representations (Chikhi et al., 2022).

A widely adopted method to reduce redundancy is to organize the k-mer set K into maximal unitigs, derived from the de Bruijn graph (dBG). In this graph, nodes represent k-mers and edges indicate $(k - 1)$ -length overlaps. A unitig corresponds to a non-branching path in this graph. Each unitig u is encoded as a string $spell (u)$ of length $| u | + k - 1$ , where |u| is the number of k-mers it contains. For example, the unitig (AAC, ACG, CGT) is represented as the string AACGT. This approach reduces space by overlapping the shared regions between adjacent k-mers. The full k-mer set K can thus be compactly encoded as a set of such unitigs U, where every k-mer in K appears as a substring of some $spell (u)$ , $u \in U$ .

An important extension of the dBG is the colored dBG, which allows for the representation of multiple datasets simultaneously. It was proposed for applications like de novo assembly and genotyping (Iqbal et al., 2012; Andreace et al., 2023), this structure annotates each k-mer with the dataset identifiers (colors) in which it appears. Colored dBGs enable compact joint representations while retaining dataset-specific provenance. This model is widely used in pangenomics (Zekic et al., 2018), RNA sequencing quantification (Bray et al., 2016), microbial classification (Luhmann et al., 2021), and related domains.

The current state-of-the-art for colored k-mer compression is GGCAT (Cracco and Tomescu, 2023), which builds compacted colored dBGs by combining k-mer counting and unitig generation with efficient color encoding. GGCAT outperforms earlier tools like Cuttlefish (Khan and Patro, 2021) and BiFrost (Luhmann et al., 2021), offering significant improvements in both compression and query speed. GGCAT uses external memory, like Cuttlefish, while BiFrost does not. However, for large datasets, GGCAT still requires several hours of computation and a large amount of memory.

In this work, we introduce USTAR-CR¹ (Unitig STitch Advanced constRuction with Colors Reordering), a fast and memory-efficient algorithm for compressing multiple k-mer sets using a plain text representation of colored dBGs. USTAR-CR generates a compact spectrum-preserving string set (SPSS) while supporting efficient color storage, enabling scalable compression of large genomic datasets with minimal computational overhead.

1.1. Related works

Plain text representations of k-mer sets have become a widely used strategy for practical and efficient data compression. Formally, such representations are defined as SPSSs—collections of strings that include all k-mers from the input data (including reverse complements), while excluding extraneous k-mers. This concept ensures that the original k-mer spectrum is preserved without redundancy.

Initially, Rahman and Medvedev (2020) and Břinda et al. (2021) independently proposed methods for constructing such representations without repeated k-mers. Rahman and Medvedev introduced the SPSS framework, while Břinda et al. coined the term simplitigs. To avoid ambiguity with the later, broader definition of SPSS (which allows repeated k-mers), we adopt simplitigs to specifically refer to representations where each k-mer appears only once.

Both UST (from Rahman and Medvedev) and ProphAsm (from Břinda et al.) apply greedy heuristics to merge k-mers into longer strings. UST builds a node-centric dBG to extend unitigs, whereas ProphAsm uses a hash-based extension strategy without explicit graph construction. These techniques aim to reduce two key metrics: the cumulative length (CL)—the total number of characters in the representation—and the string count (SC)—the number of separate strings. Lowering CL decreases the memory needed to store the strings, while reducing SC simplifies indexing, leading to overall storage savings.

Building on these ideas, the USTAR heuristic (Rossignolo and Comin, 2023, 2024a) was recently introduced. USTAR improves the traversal of dBGs by leveraging graph connectivity and local density to guide the construction of longer and fewer paths. This connectivity-aware strategy often leads to better compression than earlier greedy approaches, particularly on dense graphs.

An important breakthrough came with matchtigs (Schmidt et al., 2023), which introduced the first algorithm to compute an SPSS of minimum CL, while allowing repeated k-mers. Matchtigs formulates the problem as a min-cost path cover with a many-to-many path matching strategy, solvable in polynomial time. However, its high computational complexity— $O (n^{3} m)$ for n nodes and m arcs—makes it impractical for large graphs. To address this, the authors also proposed a faster, heuristic version called greedy matchtigs, which trades optimality for efficiency and produces comparably compact representations at a fraction of the cost. The recent USTAR2 algorithm (Rossignolo and Comin, 2024b, 2025a) further advances SPSS construction by introducing an efficient path cover heuristic that strategically reuses previously visited nodes. USTAR2 achieves compression ratios similar to greedy matchtigs, with significantly better runtime and memory usage.

While these methods focus on compressing single k-mer sets, GGCAT remains the leading tool for handling multiple datasets through compacted colored dBGs. GGCAT integrates k-mer counting and unitig construction, using contextual information to construct globally valid unitigs across datasets. It also improves color encoding by mapping color sets to compact indices, storing differences between consecutive color sets via run-length encoding (RLE), both at the individual k-mer level and across unitigs. GGCAT can also incorporate greedy matchtigs to further compress unitigs before final storage.

In the following sections, we introduce USTAR-C (Unitig STitch Advanced constRuction with Colors) and USTAR-CR (Unitig STitch Advanced constRuction with Colors Reordering), two new greedy heuristics built upon the USTAR2 paradigm, designed specifically for compressing multiple k-mer sets. USTAR-C and USTAR-CR extend these principles to colored dBGs, enabling fast and memory-efficient construction of compact SPSS representations while also efficiently encoding color information.

2. METHOD

2.1. Preliminaries

We consider strings composed of characters from the DNA alphabet $Σ = {A, C, G, T}$ . A substring of length k is called a k-mer, and we denote its reverse complement by $r c (\cdot)$ . Because the originating strand of DNA is often unknown, we treat a k-mer and its reverse complement as equivalent.

To compress a set of k-mers K, we aim to represent it using a set of longer strings S such that the complete collection of k-mers (and their reverse complements) contained in S exactly recovers K. The spectrum of a string set S, denoted $spe c_{k} (S)$ , is defined as the set of all k-mers (and their reverse complements) that appear as substrings in any string $s \in S$ :

spe c_{k} (S) = {t \in Σ^{k} ∣ \exists s \in S such that t or r c (t) is a substring of s} .

The goal is to find a minimal set of strings S such that its spectrum exactly matches that of the original k-mer set K. We formalize this as follows:

Definition 1. A Spectrum-Preserving String Set (SPSS) for a given k-mer set K is a set of strings S, each of length at least k, such that $spe c_{k} (S) = spe c_{k} (K)$ .

A central property of an SPSS is that it compactly encodes the same set of k-mers as the original input while allowing for flexible string lengths. To assess the efficiency of such a representation, we consider the CL:

C L (S) = \sum_{s \in S} | s |,

where |s| denotes the length of string s, and SC is the number of strings in S. The following optimization problem thus arises:

Problem 1. Given a k-mer set K, compute a Spectrum-Preserving String Set S that minimizes the cumulative length $C L (S)$ .

This problem can be addressed using graph-based techniques. A k-mer set can be represented as a dBG, where nodes represent k-mers and edges connect overlapping k-mers. It was shown in Schmidt et al. (2023) that this minimization problem can be solved exactly in polynomial time using an algorithm based on many-to-many minimum-cost path queries combined with minimum-cost perfect matching. The resulting algorithm, matchtigs, guarantees optimal solutions but has a time complexity of $O (n^{3} m)$ , where n is the number of nodes and m is the number of arcs. Although the authors proposed a greedy heuristic to overcome the issue of complexity, still this approach may not be well-suited for large-scale datasets.

To overcome these limitations, USTAR2 (Rossignolo and Comin, 2024b) was introduced as a fast and memory-efficient heuristic. USTAR2 approximates a minimal SPSS by leveraging the connectivity structure of the compacted dBG produced by BCALM2 (Chikhi et al., 2016) and runs in linear time with respect to the number of nodes/unitigs.

The algorithm proceeds by iteratively selecting seed nodes and extending them into paths as far as possible through unvisited neighboring nodes. This path extension continues until all nodes are included in some path. USTAR2’s key innovation lies in its seed and extension heuristics: it selects seed nodes with the highest imbalance (difference between in-degree and out-degree) and prioritizes extending into nodes with fewer connections. This strategy preserves highly connected nodes for future paths, reducing fragmentation and promoting longer paths.

By doing so, USTAR2 significantly reduces CL by producing fewer, longer strings, thus improving both compression efficiency and downstream query performance. Overall, USTAR2 achieves better compression than UST and delivers performance comparable to that of greedy matchtigs (Rossignolo and Comin, 2025a).

2.2. USTAR-CR: SPSS with colors reordering

USTAR2 compresses individual k-mer sets by generating a plain text representation of the dBG. In this article, we present USTAR-C that extends the method to handle multiple k-mer sets, each linked to a distinct color. In order to detect the k-mers colors, we use GCCAT as preprocessing. USTAR2 begins by constructing the dBG for a single k-mer set and compresses the sequences by finding a path cover on the graph. To manage multiple k-mer sets, USTAR-C simply adds a color set to each node in the graph, which tracks the colors of all k-mers represented by that node. The compression process in USTAR-C follows the same path cover approach as USTAR2, with the added step of merging these color sets to maintain color information across different sets. An example of the resulting sequences and their color sets is provided in Table 1.

Table 1.
Example Spectrum-Preserving String Set with Color Set Indices for $k = 3$

Sequences Color set indices (plain) Color set indices (RLE)

AATAGA 2 2 1 1 2–2 1–2

ACTTCG 4 4 3 3 4–2 3–2

CCAGGC 2 1 1 1 2–1 1–3

CCTCTG 3 3 3 2 3–3 2–1

CTTGAA 3 2 2 2 3–1 2–3

Sequences	Color set indices (plain)	Color set indices (RLE)
AATAGA	2 2 1 1	2–2 1–2
ACTTCG	4 4 3 3	4–2 3–2
CCAGGC	2 1 1 1	2–1 1–3
CCTCTG	3 3 3 2	3–3 2–1
CTTGAA	3 2 2 2	3–1 2–3

Each sequence is associated with a series of color set indices, which can be compacted using RLE.

RLE, run-length encoding.

Encoding colors for each k-mer in a dBG presents two main challenges: (1) efficiently recording all colors associated with each k-mer and (2) storing this information in a way that minimizes space usage and speeds up processing. To address this, for colored dBGs, we use GGCAT as preprocessing in order to detect and manage the color. Instead of storing a separate compressed color bitmap for every k-mer, GGCAT groups colors into color set indices, following a method similar to (Almodaresi et al., 2017). Moreover, each color set is encoded by capturing the differences between consecutive colors and applying RLE. When writing to disk, the color set indices of consecutive k-mers within each unitig are also run-length encoded. This is highly effective because unitigs tend to be “variation-free,” meaning they usually contain only a few distinct color set indices associated with their k-mers.

Consider that USTAR-C has produced a representation of a set of colored k-mers with $k = 3$ . The colors are grouped into sets identified by an index (Table 2). These color set indices are used as a pointer into the color table, listing the (distinct) color sets. Table 1 shows a minimal example of an SPSS representation consisting of five sequences, each of length 6, along with their corresponding color set indices. These sequences cannot be further merged or compressed because they do not share overlaps and, equivalently, are not connected within the dBG. However, each color set can be compressed individually using RLE, resulting in two runs per sequence and a total of 10 runs, as detailed in the last row of Table 1. The final output of USTAR-C consists of the list of sequences representing the k-mers and a color set (compressed with RLE) representing the colors associated with all the k-mers.

Table 2.

Example of Color Table, a Lookup Table for Color Set Indices

Color set indices	Color sets
1	1, 2
2	1, 4, 5
3	2, 3
4	3, 4, 5

GGCAT, USTAR-C, and USTAR-CR associate each k-mer to a color set index to save space.

USTAR-CR, Unitig STitch Advanced constRuction with Colors Reordering.

The k-mers in the previous example cannot be further compressed at the sequence level, as the resulting sequences do not overlap and thus cannot be concatenated. However, some sequences do share the same color patterns, which presents an opportunity to further compress the color information through improved RLE.

Specifically, if two adjacent sequences in the output share the same color at their junction—that is, the last color of one matches the first color of the next—we can merge their color sets into a single run, enhancing RLE efficiency. For instance, in Table 1, the sequences CTTGAA and AATAGA have color sets $(3, 2, 2, 2)$ and $(2, 2, 1, 1)$ , respectively. Since the last color of the first vector and the first color of the second are both 2, we can merge them into a single vector: $(3, 2, 2, 2, 2, 2, 1, 1)$ , which can be encoded as $(3 - 1, 2 - 5, 1 - 2)$ using RLE. However, this compression is only possible if the sequences are adjacent in the output. In the example shown, this condition is not met, so the color sets remain unmerged.

To exploit this compression opportunity, we can reorder the sequences to maximize consecutive matching color sets. This idea was already introduced in Pibiri (2023) for k-mers counts. Building on this idea, we introduce USTAR-CR, an extension of USTAR-C that incorporates an optimized sequence and colors reordering.

This enhancement introduces the construction of an end-point weight graph (ewG) as part of the compression pipeline. Following the formalism introduced in Pibiri (2023), in the ewG, each sequence is represented as a node with two labeled sides—its first and last color sets. Edges are drawn between node sides that share the same color set, allowing merged color runs for any connected path. This structure enables USTAR-CR to systematically reorder sequences to minimize the number of runs in the final color encoding.

This configuration enables the traversal of paths in the ewG to reorder sequences and their corresponding color sets, thereby allowing adjacent runs to be merged. To maximize such merging while ensuring that each sequence is included exactly once, the goal is to construct long, vertex-disjoint paths. As a result, minimizing the number of color runs translates to solving the problem of finding a minimum vertex-disjoint path cover.

It is worth recalling that computing a minimum-cardinality path cover in a directed graph is, in general, an NP-hard problem. However, in the specific context of k-mers with associated counts, it has been demonstrated that the problem can be solved optimally in polynomial time (Pibiri, 2023). Although an exact polynomial solution is described in Pibiri (2023), it requires additional preprocessing steps that may slow down the compression pipeline. To avoid this, we adopt an efficient greedy approximation strategy, inspired by USTAR (Rossignolo and Comin, 2024a). This method builds a path cover by always selecting the least connected node as the next extension point.

The pseudo code for USTAR-CR is shown in Algorithm 1. The process starts by sorting nodes in increasing order based on their degree, thereby giving priority to nodes with a smaller degree. A non-visited node is then selected as a seed to begin path construction. The path is grown in both directions from this seed by iteratively choosing the least connected, non-visited neighbor as the next node in the path. This approach helps preserve the more connected nodes for later use, reducing the chance of leaving isolated nodes. As nodes are added to the path, they are marked as visited to prevent reuse. Additionally, the orientation (forward or reverse) of each node relative to the path is tracked during construction.

Figure 1 illustrates an example of an ewG built from the sequences in Table 1. The red-highlighted path cover represents an arbitrary solution consisting of two paths, resulting in a total of 7 runs. In contrast, USTAR-CR constructs a more efficient path cover (shown in green) by repeatedly extending paths through nodes with the lowest degree. This approach yields a single path that visits all nodes, reducing the total number of runs to 6.

FIG. 1.

End-point weight graph from sequences in Table 1. Red: Arbitrary path cover (seven runs). Green: Optimized path cover by USTAR-CR (six runs). USTAR-CR, Unitig STitch Advanced constRuction with Colors Reordering.

This example demonstrates how leveraging the ewG can significantly enhance color compression. Starting from 10 individual runs (as shown in Table 1), the application of an arbitrary path cover reduces this number to 7, while the optimized path cover produced by USTAR-CR brings it further down to 6. The final output, presented in Table 3, reflects this reordering. Because a single path covers all nodes, all color sets can be concatenated and encoded as one unified run-length-encoded vector, achieving the minimal number of runs.

Table 3.

Output of USTAR-CR After Sequence Reordering

Sequence	Direction	Merged color set (RLE)
ACTTCG	Forward
CCTCTG	Forward
CCAGGC	Forward	4–2 3–5 2–2 1–5 2–5 3–1
AATAGA	Reverse
CTTGAA	Reverse

Color sets are merged, and RLE is compressed.

RLE, run-length encoding; USTAR-CR, Unitig STitch Advanced constRuction with Colors Reordering.

In the next section, we evaluate USTAR-C and USTAR-CR on real sequencing datasets, benchmarking its compression efficiency and runtime performance against state-of-the-art tools.

3. RESULTS

In this section, we compare our proposed methods with the state-of-the-art colored k-mer compression tool, GGCAT. Specifically, we evaluate both GGCAT configured for maximal unitigs and its variant GGCAT GM (which uses greedy matchtigs), alongside our approaches: USTAR-C, which applies plain RLE to color sets, and USTAR-CR, which further improves compression through optimized reordering of color runs.

The benchmarking was performed on a collection of 20 sequencing datasets (detailed in Table A1), previously used in related studies (Pandey et al., 2018; Rizk et al., 2013; Kokot et al., 2017; Chikhi et al., 2016; Břinda et al., 2021). These datasets span a range of sequencing properties—including paired-end and single-end reads, varying read lengths, and different coverage levels—offering a diverse and representative testing ground for evaluating compression performance under real-world conditions.

To simulate a colored k-mer setting, we merged all datasets and assigned colors to each k-mer based on the input files in which it appears. Table A2 reports the number of distinct k-mers considered in the evaluation, ranging from 400 million to over 2 billion, depending on the selected k-mer length.

To assess the performance of the evaluated tools, we focused on three primary metrics: CL, SC, and the number of runs.

CL: The total length of all compressed k-mer sequences across the datasets, reflecting the effectiveness of sequence-level compression.

SC: The number of separate sequences in the output, which serves as an indicator of fragmentation in the compressed representation.

Number of color runs ( $#$ runs): The total count of contiguous color segments after encoding, which directly correlates with the efficiency of color compression.

Together, these metrics provide a detailed evaluation of how compact and well-structured the k-mer representations are after compression.

To further compare overall compression performance, we measured the total compressed size, which includes the sequence file compressed using MFCompress (Pinho and Pratas, 2014), the encoded color table, and—for USTAR-C and USTAR-CR—the color sets compressed with bzip3. Note that for GGCAT, the most effective compression is obtained when the sequences and their colors are compressed together in a single file, which corresponds to the tool’s default output format (see Table A3).

In the following analysis, we present a detailed comparison of all tools based on compression ratio, runtime, and memory consumption.

3.1. Compression of k-mer sets

In our first experiment, we used the commonly adopted k-mer length of $k = 31$ . Table 4 reports the results for all tools prior to compression.

Table 4.
Comparison of Tools Before Compression Using Cumulative Length, Sequence Count, and Number of Runs

k = 31 GGCAT GGCAT GM USTAR-C USTAR-CR

CL 6,266,509,634 3,290,519,704 3,681,600,490 3,681,600,490

SC 135,191,765 41,341,022 43,520,096 43,520,096

No. of runs 468,952,986 78,730,727 92,771,770 54,026,538

k = 31	GGCAT	GGCAT GM	USTAR-C	USTAR-CR
CL	6,266,509,634	3,290,519,704	3,681,600,490	3,681,600,490
SC	135,191,765	41,341,022	43,520,096	43,520,096
No. of runs	468,952,986	78,730,727	92,771,770	54,026,538

GGCAT GM achieved the lowest CL and SC, while USTAR-CR yielded the fewest runs.

CL, cumulative length; SC, sequence count; USTAR-CR, Unitig STitch Advanced constRuction with Colors Reordering.

As expected, GGCAT (which generates maximal unitigs) performed the worst across all metrics, yielding the highest CL, SC, and number of runs—indicating significant redundancy in both sequences and associated color sets.

GGCAT GM, which applies a greedy matchtig strategy, significantly improved compression performance—reducing CL by 47%, SC by 69%, and the number of runs by 83%, with respect to GGCAT. USTAR-C also delivered substantial improvements, with reductions of 41%, 67%, and 80% in CL, SC, and the number of runs, respectively.

While USTAR-C and USTAR-CR produced identical results for CL and SC, USTAR-CR dramatically reduced the number of runs by 88% with respect to GGCAT and by 31% with respect to GGCAT GM, demonstrating the benefit of sequence reordering in compressing color runs.

In summary, GGCAT GM achieved the smallest CL and SC, indicating strong sequence compression, whereas USTAR-CR attained the lowest number of runs, showcasing the effectiveness of RLE-based color compression through path-based reordering.

The compression results for each method are presented in Table 5. As anticipated, GGCAT generated the largest FASTA file. GGCAT GM reduced the sequence file size by approximately one-third. In contrast, USTAR-C and USTAR-CR achieved further reductions by off-loading color information to a separate file and compressing it independently from the sequences. Thanks to its lower number of runs, USTAR-CR yielded a significantly smaller color file, resulting in the most efficient overall compression.

Table 5.

Tools Comparison: Considering the Sequences File Compressed with MFCompress (Sequences), the Colors File Compressed with bzip3 (Colors), the Colors Table, and the Total Compression Size

k = 31	GGCAT	GGCAT GM	USTAR-C	USTAR-CR
Sequences	3,198,171,033	1,028,520,445	915,769,659	908,641,367
Colors	—	—	98,742,763	50,959,415
Colors table	50,281	50,318	50,281	50,281
Total compression	3,198,221,314	1,028,570,763	1,014,562,703	959,651,063

All the measures are expressed in bytes. USTAR-CR excelled in all the metrics, followed by USTAR-C and GGCAT GM.

USTAR-CR, Unitig STitch Advanced constRuction with Colors Reordering.

In total, USTAR-CR reduced the final file size by 70% compared to GGCAT and by 6.7% compared to GGCAT GM, demonstrating the effectiveness of colors reordering and RLE in optimizing compression.

3.1.1. Compression: Different k-mer lengths

In the next section, we examine how varying the k-mer length influences the compression of colored k-mers. Changing the value of k affects both the total number of k-mers and the connectivity between them. In general, smaller k values lead to a denser graph structure due to an increased number of overlaps (refer to Table A2).

Since GGCAT produces only unitigs, its compression performance is relatively limited, as shown in the previous section. In this experiment, we instead evaluate GGCAT using its simplitigs option, which employs the greedy simplitig construction method proposed by Břinda et al. (2021) to achieve improved compression. The tools were tested using $k \in 15, 21, 31, 41$ , and the resulting compression performance is shown in Figure 2. Across all tested k values, GGCAT simplitigs consistently achieved the lowest compression. While the use of simplitigs slightly improves GGCAT’s compression, their construction inherently excludes repeated k-mers, which limits the overall compression. For $k = 15$ , USTAR-CR achieved the best compression, followed by USTAR-C and then GGCAT GM. A similar trend is observed for $k = 21$ , with USTAR-CR again providing the most compact representation. Notably, the gap between USTAR-CR and USTAR-C widens, emphasizing the importance of RLE reordering for optimal compression. At $k = 41$ , USTAR-CR once again led in performance, followed by USTAR-C and GGCAT GM. In this case, USTAR-CR delivered the largest improvement over GGCAT GM, reducing the total size by 8.67%.

FIG. 2.

Comparison of the total compressed file size (in bytes) obtained using the tools GGCAT simplitigs, GGCAT GM, USTAR-C, and USTAR-CR. The figure illustrates how the total compressed size varies for different k-mer lengths. USTAR-CR, Unitig STitch Advanced constRuction with Colors Reordering.

3.1.2. Compression: Different number of colors

In the previous section, we identified GGCAT GM as the primary competitor to USTAR-CR. We now evaluate how both tools perform as the number of colors—that is, the number of input files—increases. To do this, we subsampled the dataset with color counts $c \in 10, 15, 20$ using $k = 31$ (refer to Table A4). As c increases, the number of k-mers also grows, and the diversity of color subsets rises exponentially, making colored k-mer compression increasingly difficult. The results of this evaluation are presented in Table 6.

Table 6.
Number of Runs and Total Compression Size Varying the Number of Colors

k = 31 c = 10 c = 15 c = 20

GGCAT GM USTAR-CR GGCAT GM USTAR-CR GGCAT GM USTAR-CR

No. of color runs 36,587,632 26,418,625 40,832,766 29,116,036 78,730,727 54,026,538

Total compression 433,708,310 410,570,603 495,357,773 464,890,012 1,028,570,763 959,651,063

k = 31	c = 10	c = 15	c = 20
No. of color runs	36,587,632	26,418,625	40,832,766	29,116,036	78,730,727	54,026,538
Total compression	433,708,310	410,570,603	495,357,773	464,890,012	1,028,570,763	959,651,063

USTAR-CR, Unitig STitch Advanced constRuction with Colors Reordering.

USTAR-CR consistently achieved the lowest number of runs, resulting in the best overall compression across all tested numbers of colors. A complete comparison with USTAR-C can also be found in Table A5. The greatest difference between GGCAT GM and USTAR-CR occurred at $c = 20$ , where USTAR-CR reduced the number of color runs by 31.4% and improved total compression by 6.7%.

3.1.3. Time and memory usage

In this section, we evaluate and compare the execution time and memory consumption of the compression tools. Since USTAR-CR relies on GGCAT for preprocessing, we include both the preprocessing time (CPU time) and memory consumption as part of the overall USTAR-CR pipeline evaluation.

In Figure 3a and b, we examine execution performance across different k-mer lengths used for compression.

FIG. 3.

CPU time (seconds), speedup, and memory requirement of USTAR-CR with respect to GGCAT GM. USTAR-CR, Unitig STitch Advanced constRuction with Colors Reordering.

Overall, USTAR-CR consistently achieves the fastest execution times, followed closely by GGCAT simplitigs, while GGCAT GM is the slowest. In terms of memory usage, GGCAT GM requires the most RAM, whereas USTAR-CR and GGCAT simplitigs (not shown) use the same amounts, that is, the memory needed for the GGCAT preprocessing step. In summary, GGCAT simplitigs does not offer any clear advantage over USTAR-CR, either in compression efficiency or computational resources. Therefore, in the following comparisons, we focus on the two best-performing compression methods: GGCAT GM and USTAR-CR.

The speedup of USTAR-CR compared to GGCAT GM ranges from $3.92 \times$ up to $51.89 \times$ . The highest speedup occurs at $k = 15$ , where USTAR-CR is $51.89 \times$ faster than GGCAT GM, while using 157GB of memory—about $19.2 %$ less than GGCAT GM. A similar pattern is seen in memory usage, with USTAR-CR consistently requiring fewer resources than GGCAT GM. Figure 3c and d shows performance comparisons as the number of colors (i.e., k-mer sets) varies. We observe that USTAR-CR’s speed advantage over GGCAT GM grows with an increasing number of colors. Memory usage is generally lower for USTAR-CR as well, except when handling 10 colors, where memory requirements are roughly the same.

Overall, despite relying on GGCAT for preprocessing, USTAR-CR consistently achieves faster execution times and lower resource consumption compared to GGCAT GM. Notably, USTAR-CR is currently single-threaded, whereas GGCAT GM employs multithreading, indicating potential for further reductions in the actual running time of our tool.

3.2. Human reads dataset

In this section, we evaluate the compression quality and performance of GGCAT GM and USTAR-CR on large-scale datasets. For this purpose, we selected the human read dataset from the Genome in a Bottle Consortium (HG004_NA24143_mother),² which includes 35 files/colors and contains over 7.2 billion 31-mers.

Following the approach in Section 3.1.2, we fixed $k = 31$ and varied the number of colors from 20 to 35. Figure 4 illustrates the speedup and compression ratio of USTAR-CR with respect to to GGCAT GM. We observe that as the number of colors c increases, USTAR-CR’s speedup rises from $9.6 \times$ to $64.2 \times$ . Meanwhile, the compression ratio gradually decreases from 0.99 to 0.90, indicating a modest increase in file size by USTAR-CR with respect to GGCAT GM, in exchange for substantially faster processing. Detailed absolute values for all tools can be found in Tables A6 and A7.

FIG. 4.

Results related to the human reads dataset. In the figure, the speedup and the compression ratio change with the number of colors while $k = 31$ .

3.3. Salmonella whole genomes dataset

Finally, we evaluated the tools using complete genomes. Unlike the read sequences analyzed in the previous sections—short fragments with high coverage—genomes consist of long, contiguous sequences. For this experiment, we used a dataset of 10,000 Salmonella genomes, previously employed in GGCAT (Cracco and Tomescu, 2023). This dataset comprises approximately 168 million k-mers, which is significantly smaller than the 7.2 billion k-mers contained in the human read dataset.

Table 7 summarizes the results obtained when varying the number of genomes (colors). For 100 genomes, USTAR-CR achieves better overall compression than GGCAT GM. However, as the number of genomes increases, this advantage gradually diminishes, and for larger datasets, the compression performance becomes comparable. For both tools, on these large datasets, the total compression size is largely influenced by the efficiency of the color encoding.

Table 7.
Tools Comparison Varying the Number of Colors: Considering the Sequences File Compressed with MFCompress (Sequences), the Colors File Compressed with bzip3 (Colors), the Colors Table, and the Total Compression Size

k = 31 c = 100 c = 1000 c = 10,000

GGCAT GM USTAR-CR GGCAT GM USTAR-CR GGCAT GM USTAR-CR

Sequences 11,098,362 7,231,466 28,356,488 16,170,653 74,638,296 44,283,254

Colors — 3,398,385 — 12,210,134 — 41,277,956

Colors table 4,047,496 4,046,475 75,376,271 75,393,160 1,179,428,589 1,179,359,265

Total compression 15,145,858 14,676,326 103,732,759 103,773,947 1,254,066,885 1,264,920,475

Time (seconds) 19,728 14,411 (11) 35,904 27,394 (34) 104,240 103,676 (124)

k = 31	c = 100	c = 1000	c = 10,000
Sequences	11,098,362	7,231,466	28,356,488	16,170,653	74,638,296	44,283,254
Colors	—	3,398,385	—	12,210,134	—	41,277,956
Colors table	4,047,496	4,046,475	75,376,271	75,393,160	1,179,428,589	1,179,359,265
Total compression	15,145,858	14,676,326	103,732,759	103,773,947	1,254,066,885	1,264,920,475
Time (seconds)	19,728	14,411 (11)	35,904	27,394 (34)	104,240	103,676 (124)

All the measures are expressed in bytes. The time taken by USTAR-CR is in brackets.

USTAR-CR, Unitig STitch Advanced constRuction with Colors Reordering.

It is worth noting that both USTAR-CR and GGCAT GM rely on the same GGCAT preprocessing phase, which constructs the unitigs and the compression table. In these experiments, it becomes evident that, for large datasets, the color table dominates the total storage cost, and this inefficiency is inherited by both tools. Recent studies have proposed improved color encoding schemes (Campanelli et al., 2024), which could potentially enhance USTAR-CR’s performance as well.

Regarding execution time, USTAR-CR is substantially faster than GGCAT GM for datasets with 100 genomes, though the performance gap narrows as the dataset size grows. Interestingly, for the largest dataset, the time spent on GGCAT’s preprocessing far exceeds that of USTAR-CR (shown in parentheses), which typically completes within seconds.

Overall, when working with large datasets, the main bottleneck remains the construction of the colored dBG. Although color encoding can become challenging as the number of colors increases, USTAR-CR can take advantage of recent advances in this area (Campanelli et al., 2024).

4. CONCLUSIONS

In this article, we present USTAR-CR, a novel algorithm designed for efficient compression of multiple k-mer sets. USTAR-CR leverages node connectivity in the colored dBG to produce a more compact plain text representation and employs an optimized encoding scheme for k-mer colors.

Our comparative analysis against GGCAT and GGCAT GM demonstrates that USTAR-CR outperforms these tools in both compression effectiveness and resource usage. For the widely used k-mer length of 31, USTAR-CR achieved the lowest number of runs ( $#$ runs) through colors reordering, significantly reducing color redundancy and resulting in compressed files that are 70% smaller than those from GGCAT and 6.7% smaller than GGCAT GM. This advantage holds across different k-mer lengths, with USTAR-CR showing particular strength at lower k values where graph density increases. Furthermore, as the number of colors grows, USTAR-CR consistently delivers the fewest runs. However, as the number of colors increases, GGCAT’s color encoding becomes a major bottleneck. As future work, we plan to integrate into USTAR-CR more efficient color representation schemes, such as those proposed in Campanelli et al. (2024).

Regarding speed and memory consumption, USTAR-CR proved to be highly efficient, achieving up to a $51.89 \times$ speedup over GGCAT GM at $k = 15$ , while maintaining significant performance gains across all tested values of k. Its memory footprint is also consistently lower, especially for smaller k values, making it a resource-friendly choice. In summary, USTAR-CR surpasses existing state-of-the-art methods by providing a fast, memory-efficient, and highly compressed solution for representing colored k-mer sets.

AUTHORS’ CONTRIBUTIONS

All authors conceived the study and drafted the article. E.R. implemented the software and performed the experiments. All authors have read and approved the article for publication.

Footnotes

ACKNOWLEDGMENTS

The authors would like to thank the reviewers for their constructive comments.

AUTHOR DISCLOSURE STATEMENT

The authors declare they have no conflicting financial interests.

FUNDING INFORMATION

Authors are supported by the Project funded under the National Recovery and Resilience Plan (NRRP), Mission 4 Component 2 Investment 1.4—Call for tender No. 3138 of December 16, 2021, rectified by Decree No. 3175 of December 18, 2021, of Italian Ministry of University and Research funded by the European Union—NextGenerationEU, and by the EUAqua Project funded by the European Union under Grant Agreement 101181589.

Appendix

Table A7.

Human Reads Dataset CPU Time (Seconds) and Speedup of USTAR-CR with Respect to GGCAT GM

CPU time (seconds)	No. of colors	GGCAT	GGCAT GM	USTAR-CR	Speedup
k = 31	35	28,397	4,302,924	67,034	64.19
	30	15,769	1,059,323	36,101	29.34
	25	7241	211,815	15,650	13.53
	20	5334	128,924	13,426	9.60

USTAR-CR, Unitig STitch Advanced constRuction with Colors Reordering.

1

A preliminary version of this work has been presented at ICCABS 2025 (Rossignolo and Comin, ).

2

References

Almodaresi

, Pandey

, Patro

. Rainbowfish: A Succinct Colored de Bruijn Graph Representation. In: 17th International Workshop on Algorithms in Bioinformatics (WABI 2017), volume 88 of Leibniz International Proceedings in Informatics (LIPIcs). ( Schwartz

, Reinert

., eds.) Schloss Dagstuhl – Leibniz-Zentrum für Informatik: Dagstuhl, Germany; 2017; pp. 18:1–18:15.

Andreace

, Lechat

, Dufresne

, et al. Comparing methods for constructing and representing human pangenome graphs. Genome Biol, 2023; 24(1):274.

Andreace

, Pizzi

, Comin

. Metaprob 2: Metagenomic reads binning based on assembly using minimizers and k-mers statistics. J Comput Biol, 2021; 28(11):1052–1062.

Bankevich

, Nurk

, Antipov

, et al. Spades: A new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol, 2012; 19(5):455–477.

Bray

, Pimentel

, Melsted

, et al. Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol, 2016; 34(5):525–527.

Břinda

, Baym

, Kucherov

. Simplitigs as an efficient and scalable representation of de Bruijn graphs. Genome Biol, 2021; 22(1):96.

Campanelli

, Pibiri

, Fan

, et al. Where the patterns are: Repetition-aware compression for colored de Bruijn graphs. J Comput Biol, 2024; 31(10):1022–1044; doi: 10.1089/cmb.2024.0714

Cavattoni

, Comin

. Classgraph: Improving metagenomic read classification with overlap graphs. J Comput Biol, 2023; 30(6):633–647.

Chikhi

, Holub

, Medvedev

. Data structures to represent a set of k-long DNA sequences. ACM Comput Surv, 2022; 54(1):1–22.

10.

Chikhi

, Limasset

, Medvedev

. Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics, 2016; 32(12):i201–i208.

11.

Conway

, Bromage

. Succinct data structures for assembling large genomes. Bioinformatics, 2011; 27(4):479–486.

12.

Cracco

, Tomescu

. Extremely fast construction and querying of compacted and colored de Bruijn graphs with GGCAT. Genome Res, 2023; 33(7):1198–1207.

13.

Denti

, Previtali

, Bernardini

, et al. Malva: Genotyping by mapping-free allele detection of known variants. iScience, 2019; 18:20–27.

14.

Harris

, Medvedev

. Improved representation of sequence bloom trees. Bioinformatics, 2020; 36(3):721–727.

15.

Iqbal

, Caccamo

, Turner

, et al. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat Genet, 2012; 44(2):226–232.

16.

Khan

, Patro

. Cuttlefish: Fast, parallel and low-memory compaction of de Bruijn graphs from large-scale genome collections. Bioinformatics, 2021; 37(Suppl_1):i177–i186.

17.

Kokot

, Dlugosz

, Deorowicz

. Kmc 3: Counting and manipulating k-mer statistics. Bioinformatics, 2017; 33(17):2759–2761.

18.

Luhmann

, Holley

, Achtman

. Blastfrost: Fast querying of 100,000s of bacterial genomes in bifrost graphs. Genome Biol, 2021; 22(1):30.

19.

Marchet

, Iqbal

, Gautheret

, et al. Reindeer: Efficient indexing of k-mer presence and abundance in sequencing datasets. Bioinformatics, 2020; 36(Suppl_1):i177–i185.

20.

Marcolin

, Andreace

, Comin

. Efficient k-mer indexing with application to mapping-free SNP genotyping. In Proceedings of the 15th International Joint Conference on Biomedical Engineering Systems and Technologies, BIOSTEC 2022, Volume 3: BIOINFORMATICS. ( Lorenz

, Fred

ALN

, Gamboa

., eds.) BIOSTEC; 2022; pp. 62–70.

21.

Monsu

, Comin

. Fast alignment of reads to a variation graph with application to snp detection. J Integr Bioinform, 2021; 18(4):20210032.

22.

Ondov

, Treangen

, Melsted

, et al. Mash: Fast genome and metagenome distance estimation using minhash. Genome Biol, 2016; 17(1):132–114.

23.

Pandey

, Bender

, Johnson

, et al. Squeakr: An exact and approximate k-mer counting system. Bioinformatics, 2018; 34(4):568–575.

24.

Pibiri

. On weighted k-mer dictionaries. Algorithms Mol Biol, 2023; 18(1):3.

25.

Pinho

, Pratas

. Mfcompress: A compression tool for fasta and multi-fasta data. Bioinformatics, 2014; 30(1):117–118.

26.

Qian

, Comin

. Metacon: Unsupervised clustering of metagenomic contigs with probabilistic k-mers statistics and coverage. BMC Bioinformatics, 2019; 20(Suppl 9):367.

27.

Rahman

, Medvedev

. Representation of k-mer sets using spectrum-preserving string sets. In International Conference on Research in Computational Molecular Biology. Springer; 2020; pp. 152–168.

28.

Rizk

, Lavenier

, Chikhi

. DSK: K -mer counting with very low memory usage. Bioinformatics, 2013; 29(5):652–653.

29.

Rossignolo

, Comin

. A linear algorithm for efficient representation of k-mer sets using de Bruijn graphs. Communications in Computer and Information Science, 2025a;2546:167–191.

30.

Rossignolo

, Comin

. Enhanced compression of k-mer sets with counters via de Bruijn graphs. J Comput Biol, 2024a;31(6):524–538.

31.

Rossignolo

, Comin

. Fast and succinct compression of k-mer sets with plain text representation of colored de Bruijn graphs. In: Computational Advances in Bio and Medical Sciences: 13th International Conference, ICCABS 2025, ICCABS; 2025b; pp. 54–65.

32.

Rossignolo

, Comin

. Ustar: Improved compression of k-mer sets with counters using de Bruijn graphs. In: Bioinformatics Research and Applications. ( Guo

, Mangul

, Patterson

, Zelikovsky

., eds.), Springer Nature: Singapore; 2023; pp. 202–213.

33.

Rossignolo

, Comin

. Ustar2: Fast and succinct representation of k-mer sets using de Bruijn graphs. In: Proceedings of the 17th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 1: BIOINFORMATICS. INSTICC, SciTePress; 2024b; pp. 368–378.

34.

Schmidt

, Khan

, Alanko

, et al. Matchtigs: Minimum plain text representation of k-mer sets. Genome Biol, 2023; 24(1):136.

35.

Storato

, Comin

. K2mem: Discovering discriminative k-mers from sequencing data for metagenomic reads classification. IEEE/ACM Trans Comput Biol Bioinform, 2022; 19(1):220–229.

36.

Sun

, Medvedev

. Toward fast and accurate snp genotyping from whole genome sequencing data for bedside diagnostics. Bioinformatics, 2019; 35(3):415–420.

37.

Wood

, Salzberg

. Kraken: Ultrafast metagenomic sequence classification using exact alignments. Genome Biol, 2014; 15(3):R46–R12.

38.

Zekic

, Holley

, Stoye

. Pan-Genome Storage and Analysis Techniques. Springer: New York, NY; 2018; pp. 29–53.

USTAR-CR: Efficient and Compact Compression of k -Mer Sets Through Colored de Bruijn Graphs

Abstract

Keywords

1. INTRODUCTION

1.1. Related works

2. METHOD

2.1. Preliminaries

2.2. USTAR-CR: SPSS with colors reordering

Table 1. Example Spectrum-Preserving String Set with Color Set Indices for k = 3 Sequences Color set indices (plain) Color set indices (RLE) AATAGA 2 2 1 1 2–2 1–2 ACTTCG 4 4 3 3 4–2 3–2 CCAGGC 2 1 1 1 2–1 1–3 CCTCTG 3 3 3 2 3–3 2–1 CTTGAA 3 2 2 2 3–1 2–3

3.1. Compression of k-mer sets

AUTHORS’ CONTRIBUTIONS

Footnotes

ACKNOWLEDGMENTS

AUTHOR DISCLOSURE STATEMENT

FUNDING INFORMATION

Appendix

1

2

References

Table 1.
Example Spectrum-Preserving String Set with Color Set Indices for $k = 3$

Sequences Color set indices (plain) Color set indices (RLE)

AATAGA 2 2 1 1 2–2 1–2

ACTTCG 4 4 3 3 4–2 3–2

CCAGGC 2 1 1 1 2–1 1–3

CCTCTG 3 3 3 2 3–3 2–1

CTTGAA 3 2 2 2 3–1 2–3