Efficient and accurate search in petabase-scale sequence repositories

Representing de Bruijn graphs

According to the formal definition of a de Bruijn graph, every pair of k-mers (short contiguous subsequences of length k) with an overlap of k − 1 characters are connected with a directed edge. It can be seen that in the de Bruijn graph, the edges can always be derived from the nodes alone, and hence, it is sufficient to store only the k-mer set. These k-mers are used as elementary indexing tokens in MetaGraph.

MetaGraph provides several data structures for storing k-mer sets, which are used as a basis to implement different representations of the de Bruijn graph abstraction. In addition to the simple hash table, the k-mers may be stored in an indicator bitmap⁶¹ (a binary vector represented as a succinct bitmap of size $|\,\varSigma \,^{k}$ indicating which k-mers are present in the set) or in the BOSS table¹ (a data structure proposed by Bowe, Onodera, Sadakane and Shibuya for storing a set of k-mers succinctly; see Supplementary Fig. 3 and the corresponding Supplementary Table 2 for an example BOSS graph and table). We call our de Bruijn graph implementations based on these data structures HashDBG, BitmapDBG and SuccinctDBG, respectively. For detailed descriptions and the properties of these representations, refer to Supplementary Section A.3.1 and Supplementary Table 1. All these data structures support exact membership queries, and they map k-mers to positive indexes from 1 to n, where n is the number of k-mers in the represented set (or zero if the queried k-mer does not belong to the set). While HashDBG is mostly used internally (for example, for batched sequence search), SuccinctDBG typically exhibits the best compression performance and, therefore, serves as the default compressed representation.

Scalable k-mer enumeration and counting

To count and de-duplicate k-mers, MetaGraph uses the following approach to generate the so-called k-mer spectrum: the input k-mers are appended to a list, which is sorted and de-duplicated every time it reaches the allocated space limit or after all input k-mers have been processed. During de-duplication, the k-mer counts are summed up to maintain the total count of each unique k-mer. We call this approach SortedSet. Moreover, MetaGraph offers the SortedSetDisk approach, which implements a similar algorithm in external memory. Using a pre-allocated fixed-size buffer limits memory usage and allows for constructing virtually arbitrarily large k-mer spectra but requires a larger amount of disk I/O. Lastly, MetaGraph supports passing precomputed outputs from the KMC3 (ref. ⁶²) k-mer counting tool (fork karasikov/KMC commit b163688) as an input to make use of its exceptionally efficient counting algorithm and filters. Once the entire k-mer spectrum is obtained and all k-mers are sorted, they are converted into the final data structure to construct the target graph representation.

Extracting contigs and unitigs from graphs

All sequences encoded in the graph (or any defined subgraph) can be extracted from it and stored in FASTA format through graph traversal^63,64. The graph is fully traversed and its paths, formed by consecutive overlapping k-mers, are converted into sequences (contigs) that are returned as a result of this operation. Each k-mer of the graph (or subgraph) appears in the assembled contigs exactly once. Thus, the resulting set of sequences is a disjoint node cover of the traversed graph.

MetaGraph provides efficient parallel algorithms for sequence extraction and distinguishes two main types of traversal: (1) traversal in contig mode extends a traversed path until no further outgoing edge is present or if all the next outgoing edges have already been traversed; while (2) traversal in unitig mode only extends a path if its last node has a single outgoing edge, and this edge is the single edge incoming to its target node. This definition of a unitig matches the one described previously⁶⁵.

Basic, canonical and primary graphs

When indexing raw reads sequenced from unknown strands, we supplement each sequence with its reverse complement, which is then indexed along with the original sequence. As a result, the de Bruijn graph accumulates each k-mer in both orientations. Such graphs (which we call canonical) can be represented by storing only one orientation of each k-mer and simulating the full canonical graph on-the-fly (for example, for querying outgoing edges, return not only the edges outgoing from the source k-mer, but also all edges incoming to its reverse complement k-mer).

Storing only canonical k-mers (that is, the lexicographically smallest of the k-mer and its reverse complement) effectively reduces the size of the graph by up to two times. However, this cannot be efficiently used with the succinct graph representation based on the BOSS table. The BOSS table, by design, requires that each k-mer in it has other k-mers overlapping its prefix and suffix of length k − 1 (at least one incoming and one outgoing edge in the de Bruijn graph). However, it is often the case that among two consecutive k-mers in a read, only one of them is canonical. Thus, storing only canonical k-mers in the BOSS table would often require adding several extra dummy k-mers for each real k-mer (Supplementary Fig. 4), which makes this approach memory inefficient. We overcome this issue by constructing primary graphs, where the word primary reflects the traversal order, as described in the next paragraph.

When traversing a canonical de Bruijn graph, we can additionally apply the constraint that only one of the orientations of a given k-mer is called. More precisely, the traversal algorithm works as usual, but never visits a k-mer if its reverse complement has already been visited. Whichever orientation of the forward or reverse complement k-mer is visited first is considered to be the primary k-mer of the pair (an example illustration is shown in Supplementary Fig. 6). This results in a set of sequences, which we call primary (primary contigs or primary unitigs). Note that the traversal order of the graph may change the set of primary sequences extracted from it, but it may never change the total number of k-mers in these sequences (primary k-mers). This is relevant when extracting primary contigs with multiple threads since the node traversal order may differ between runs. We call graphs constructed from primary sequences primary graphs. In contrast to the common approach in which only canonical k-mers are stored, primary de Bruijn graphs can be efficiently represented succinctly using the BOSS table, and effectively enable us to reduce the size of the graph part of the MetaGraph index by up to two times.

Graph cleaning

When a graph is constructed from raw sequencing data, it might contain a considerable number of k-mers resulting from sequencing errors (erroneous k-mers). These k-mers do not occur in the biological sequences and make up spurious paths in the graph, which one may desire to prune off. True-signal k-mers may also originate from contaminant organisms in the biological sample. Pruning the graph to discard either or both of these classes of undesirable k-mers is called graph cleaning.

MetaGraph provides routines for graph cleaning and k-mer filtering based on the assumption that k-mers with relatively low abundance (low k-mer counts) in the input data were probably generated due to sequencing errors or contamination and should therefore be dropped. To identify potentially undesirable k-mers, we use an algorithm proposed previously⁶⁵. In MetaGraph, we adapted and scaled up this algorithm to work not only for small but also for very large graphs (up to trillions of nodes).

In brief, the decision to filter out a k-mer is based on the median abundance of the unitig to which this k-mer belongs. That is, k-mers with low abundance are preserved if they are situated in a unitig with sufficiently many (more precisely, at least 50%) highly abundant (solid) k-mers. Then the entire unitig is considered solid and is kept in the graph. All solid unitigs (which may also be concatenated into contigs called clean contigs) are extracted from the graph and output in FASTA format. Connected unitigs (those with non-zero degree) that are discarded due to lack of abundance typically originate from sequencing errors, and their removal is traditionally called bubble popping⁶⁶. Optionally, all tips (that is, unitigs where the last node has no outgoing edges) that are shorter than a given cut-off (typically 2k) are discarded as well. Afterwards, a new graph can be constructed from these clean contigs, which we call a cleaned graph.

The abundance threshold for solid unitigs can be set either manually or computed automatically from the full k-mer spectrum. It is assumed that k-mers with an abundance of at most 3 are likely to be generated by sequencing errors and that all erroneous k-mers follow a negative binomial distribution. After fitting a negative binomial distribution to these low-abundance k-mers, the abundance threshold is set to the 99.9th percentile of this distribution⁶⁵. Finally, in case the chosen threshold leads to preserving less than 20% of the total coverage, the automatic estimation procedure is deemed unsuccessful and a pre-defined value (typically 2) is used as a fallback threshold instead.

Constructing a joint graph from multiple samples

According to our workflow, when indexing multiple read sets (especially when indexing vast collections of raw sequencing data), the recommended workflow for constructing a joint de Bruijn graph from the input samples consists of the following three steps. First, we independently construct a de Bruijn graph from each input read set. As each graph is constructed from a single read set (or sample), we call these graphs sample graphs. If desired, these sample graphs are independently cleaned with the graph cleaning procedure described above. Then, each sample graph is decomposed into a set of (clean) contigs, either by extracting the contigs directly or as a result of the graph cleaning procedure. Finally, a new de Bruijn graph is constructed from all these contigs, which is then annotated to represent the relation between the k-mers and the input samples. As this graph represents the result of merging all sample graphs, we refer to it as the joint de Bruijn graph. In practice, the size of the contigs extracted from sample graphs is up to 100 times smaller than the raw input, which makes the construction of the joint de Bruijn graph by this workflow much more efficient compared with constructing it directly from the original raw read sets.

Graph annotations

Once a de Bruijn graph is constructed, it can already be used to answer k-mer membership queries, that is, to check whether a certain k-mer belongs to the graph or not. However, the de Bruijn graph alone can encode no additional metadata (such as sample ID, organism, chromosome number, expression level or geographical location). Thus, we supplement the de Bruijn graph with another data structure called an annotation matrix. Each column of the annotation matrix A ∈ {0,1}^n×m, where n is the number of k-mers in the graph and m is the number of annotation labels, is a bit vector indicating which k-mers possess a particular property:

$${A}_{j}^{i}:= \{\begin{array}{cc}1 & k \mbox{-} {\rm{m}}{\rm{e}}{\rm{r}}\,i\,{\rm{i}}{\rm{s}}\,{\rm{i}}{\rm{n}}\,{\rm{r}}{\rm{e}}{\rm{l}}{\rm{a}}{\rm{t}}{\rm{i}}{\rm{o}}{\rm{n}}\,{\rm{w}}{\rm{i}}{\rm{t}}{\rm{h}}\,{\rm{a}}{\rm{t}}{\rm{t}}{\rm{r}}{\rm{i}}{\rm{b}}{\rm{u}}{\rm{t}}{\rm{e}}\,j,\\ 0 & {\rm{o}}{\rm{t}}{\rm{h}}{\rm{e}}{\rm{r}}{\rm{w}}{\rm{i}}{\rm{s}}{\rm{e}}\end{array}$$

Without loss of generality, we will assume that the annotation matrix encodes the membership of k-mers to different samples, that is, encodes sample IDs. In this case, ${A}_{j}^{i}=1$ indicates that k-mer i appears in sample j (note that the same k-mer may also appear in multiple samples).

The number of rows of the annotation matrix corresponds to the number of k-mers indexed in the de Bruijn graph, and the annotation matrix can therefore be of enormous size, containing up to 10¹² rows and 10⁹ columns. However, this matrix is typically extremely sparse and can therefore be efficiently compressed.

Representing graph annotations in MetaGraph

Independent of the choice of graph representation, a variety of methods are provided in MetaGraph for compressing annotation matrices to accommodate different query types. These different matrix representation schemes can be split into row-major and column-major. The row-major representations (such as RowSparse⁴, RowFlat^67,68) enable fast row queries but have a poor performance of column queries. By contrast, the column-major representations (such as ColumnCompressed, Multi-BRWT³) provide fast access to individual columns of the annotation matrix but typically have a poorer row query performance. Despite being mathematically equivalent, up to the transposition of the represented matrix, these schemes are in fact algorithmically different due to the number of rows of the annotation matrix typically being orders of magnitude larger than the number of columns. A detailed description of these and other representation schemes is provided in Supplementary Information A.4. The size and query time benchmarks, respectively, of the compression methods used in our representations are provided in Supplementary Figs. 1 and 2.

The RowDiff compression technique

Due to the nature of de Bruijn graphs and the fact that adjacent nodes (k-mers) usually originate from the same sequences, it turns out that, in practice, adjacent nodes in the graph are likely to carry identical or similar annotations. The RowDiff compression technique^4,5 exploits this regularity by replacing the annotations at nodes with their relative differences. This enables us to substantially sparsify the annotation matrix and, therefore, considerably improve its compressibility. Notably, the transformed annotation matrix can still be represented with any other available scheme, including those described above. MetaGraph provides a scalable implementation of this technique with efficient construction algorithms, which allow applying it to virtually arbitrarily large annotation matrices. The algorithm essentially consists of two parts. First, for each node with at least one outgoing edge, it picks one of the edges and marks its target node as a successor. Second, it replaces the original annotations at nodes with their differences from the annotations at their assigned successor nodes. This delta-like transform is applied to all nodes in the graph except for a small subset of them (called anchors), which keep their original annotation unchanged and serve to end every path composed of successors and break the recursion when reconstructing the original annotations (for example, in cycles).

Counting de Bruijn graphs

Finally, MetaGraph supports generalized graph annotations for representing quantitative information such as k-mer positions and their abundances in input sources, encoded with non-binary matrices⁵.

Annotation construction

The typical workflow for constructing an annotation matrix for a large input set consists of the following steps. After the joint de Bruijn graph has been constructed from the input sequences, we iterate over the different samples (corresponding to the different annotation labels) in parallel and map all k-mers of each sample to the joint graph, generating a single annotation column. To avoid the mapping of identical k-mers multiple times and to prevent the processing of erroneous k-mers (k-mer with sequencing errors), we use the unitigs extracted from the cleaned sample graphs instead of the raw sequences when annotating the graph. This substantially reduces the annotation construction time, especially when the joint graph is represented with SuccinctDBG, for which the traversal to an adjacent k-mer is several times faster than a k-mer lookup performed from scratch (Supplementary Information A.3.2).

Once an annotation matrix has been constructed (typically in the ColumnCompressed representation), it can be transformed to any other representation to achieve the desired trade-offs between the representation size and the performance of the required operations. In particular, for sequence search, the recommended workflow is to apply the RowDiff transform on the annotation matrix and then convert the sparsified columns to the Multi-BRWT or RowSparse representation, depending on the desired speed versus memory trade-off.

Dynamic index augmentation and batch updates

Generally, there are three strategies for extending a fully constructed MetaGraph index (a joint de Bruijn graph and its corresponding annotation matrix). First, the batch of new sequences can be indexed separately and that second MetaGraph index can be hosted on the same or on a different server. Then, these two indexes can be queried simultaneously, as it is done for a distributed MetaGraph index.

Second, the graph can be updated directly if it is represented using a dynamic data structure that supports dynamic updates (for example, SuccinctDBG (dynamic)). Then, the annotation matrix needs to be updated accordingly. This approach allows making instant changes. However, it does not enable large updates because of the limited performance of dynamic data structures⁶⁹.

Finally, for large updates, the existing index can be reconstructed entirely. For the reconstruction, the index is first decomposed into contig buckets, where each bucket stores contigs extracted from the subgraph induced by the respective annotation column. Then, these buckets are augmented with the new data (either by adding the new sequences directly or by pre-constructing sample graphs from the new sequences and adding contigs extracted from them), and a new MetaGraph index is constructed from these augmented buckets. Notably, this approach uses a non-redundant set of contigs and does not require processing raw data from scratch again. Furthermore, instead of extracting contigs from the old index, it is also possible to use the inputs initially used to construct the old index (for example, the contigs extracted from sample graphs), which can substantially simplify the process.

Speeding up k-mer matching

For higher k-mer matching throughput, we implemented several techniques to speed up this procedure. First, when mapping k-mers to a primary graph (defined above), each k-mer may generally have to be searched twice (first that k-mer and then its reverse complement). Nevertheless, if a k-mer has been found, there is no need to search for its reverse complement. In fact, it is guaranteed in that case that the reverse complement k-mer would be missing in the graph. However, if a certain k-mer from the query is missing in the graph but its reverse complement is found, it is likely that, for the next k-mer from the query sequence, which is adjacent to the current one, the same applies. Thus, in such cases, we directly start the search for the next k-mer by querying the graph with the reverse complement and checking for the original k-mer only if that reverse complement k-mer is not found in the graph.

When the graph is represented with the BOSS table, indexing k-mer ranges in the BOSS table (as described in Supplementary Information A.3.2) greatly speeds up k-mer lookups, especially relevant when querying short sequences or arbitrary sequences against a primary graph.

Another optimization consists of querying the annotation matrix in batches, which improves cache locality and removes possible row duplications. To go further and speed up k-mer mapping as well, we developed the batch query algorithm described below in detail.

Batched sequence search

To increase the throughput of sequence search for large queries (for example, sets of sequencing reads or long sequences), we have designed an additional batch query algorithm schematically shown in Extended Data Fig. 1e. The algorithm exploits possible query set redundancy: the presence of k-mers shared between individual queries. More precisely, query sequences are processed in batches and an intermediate batch graph is constructed from each batch. This batch graph is then effectively intersected with the large joint graph from the MetaGraph index. The result of this intersection operation forms a relatively small subgraph of the joint graph, which we call a query graph. It is represented in a fast-to-query uncompressed format (HashDBG). In practice, this intersection is performed as follows. First, the batch graph is traversed (step 2 in Extended Data Fig. 1e) to extract a non-redundant set of contigs that are afterwards mapped against the joint graph through exact k-mer matching (step 3) and the respective annotations are extracted from the compressed index accordingly to construct the query graph with its respective annotations representing the intersection of the batch graph with the full MetaGraph index (step 4). All of the query sequences from the current batch are then queried against this query graph (step 5). Depending on the structure of the query data, this algorithm achieves a 10- to 100-fold speedup compared to unbatched queries.

Sequence search with alignment

For cases in which the sensitivity of sequence search through exact k-mer matching is insufficient, we developed several approaches for aligning sequences to the MetaGraph index, a process known as sequence-to-graph alignment^{5,6,23,70,71,72,73,74,75,76}. Note that each approach has its target use cases and the choice should be made based on the particular application and the problem setting.

Each alignment algorithm takes a classical seed-and-extend approach^{27,71,72,73,74}. Given an input sequence, the seeds are composed by joining consecutive k-mer matches within the graph’s unitigs (called unitig maximal exact matches, or Uni-MEMs^72,77). Although, by default, this restricts the seeds to be at least of length k, representing the graph with the BOSS table allows for relaxing this restriction by mapping arbitrarily short sequences to suffixes of the k-mers indexed in the graph (as described in Supplementary Information A.3.2).

Each seed is extended in the graph forwards and backwards to produce a complete local alignment. Similarly to how GraphAligner^78,79 builds on Myers’ algorithm⁸⁰, our extension algorithm is a generalization of the Smith–Waterman–Gotoh local alignment algorithm with affine gap penalties⁸¹. The user can choose to report multiple alignments for each query, which may be found if seeds to multiple locations in the graph are discovered.

We now describe the extension algorithm in more detail. Given a seed, let $s={s}_{1}\cdots {s}_{{k}^{{\prime} }}$ denote the suffix of the query sequence starting from the first character of the seed. We use a dynamic programming table to represent the scores of the best partial alignments. More precisely, each node v has three corresponding integer score vectors S_v, E_v, and F_v of size equal to the query length ${\ell }$. S_v[i] stores the best alignment score of the prefix ${s}_{1}\cdots {s}_{i}$ ending at node v. E_v[i] and F_v[i] represent the best alignment scores of ${s}_{1}\cdots {s}_{i}$ ending with an insertion and deletion at node v, respectively.

Let v_S be the first node of the seed S. We define an alignment tree T_S = (V_S,E_S) rooted at v_S encoding all walks traversed during the search starting from v_S, where V_S ⊂ V × ℕ contains all the nodes of the paths originating at v_S and E_S ⊂ V_S × V_S contains all the edges within these paths. T_S is constructed on-the-fly during the seed extension process by extending it with new nodes and edges after each graph traversal step.

As the size of T_S can grow exponentially if all paths are explored (and is, in fact, of infinite size if the graph is cyclic), we traverse the graph and update T_S in a score-guided manner. For this, we maintain a priority queue graph nodes and corresponding score vectors to be traversed, prioritizing nodes whose traversal led to the best local score update⁷⁸. We use several heuristics to restrict the alignment search space. First, we use the X-drop criterion^82,83, skipping an element if it is more than X units lower than the current best-computed alignment score. Moreover, we maintain an aggregated score column for each graph node storing the element-wise maximum score achieved among the score columns of each node in T_S. Using this, we discard nodes in T_S from further consideration if their traversal did not update the aggregate score column. Finally, we apply a restriction on the total number of nodes which can be explored as a constant factor of ${\ell }$.

To find seeds of length k′ < k (by default, we use a seed length of 19) matching the suffixes of nodes in the canonical graph, a three-step approach is taken. First, seeds corresponding to the forward orientation of the query are found, which correspond to contiguous node ranges in the BOSS representation of the graph (a description of the node range matching algorithm is provided in Supplementary Information A.3.2). The next two steps then retrieve suffix matches which are represented in their reverse complement form in the graph. In the second step, the reverse complements of the query k-mers are searched to find node ranges corresponding to suffix matches of length k′. Finally, these ranges are traversed forwards k − k′ steps in the graph to make the prefixes of these nodes correspond to the sequence matched. The reverse complements of these nodes are then returned as the remaining suffix matches.

While primary graphs act as an efficient representation of canonical de Bruijn graphs, special considerations need to be made when aligning to these graphs to ensure that all paths that are present in the corresponding canonical graph are still reachable. For this, we introduce a further extension of the alignment algorithm to allow for alignment to an implicit canonical graph while only keeping a primary graph in memory. During seed extension, the children of a given node are determined simply by finding the children of that node in the primary graph, along with the parents of its reverse complement node. Finding exact matching seeds of length k′ ≥ k can be achieved in a similar manner, searching for both the forward and reverse complement of each k-mer in the primary graph.

MetaGraph maintains three different alignment approaches that determine how the graph is traversed during seed extension, called MetaGraph-Align, SCA and TCG-Aligner, each applying different restrictions to traversal.

MetaGraph-Align

In this approach, the sequences are aligned against the joint de Bruijn graph to compute their respective closest walks in the graph. After computing a set of alignments, they are used in place of the original sequences to fetch their corresponding annotations. This approach allows for aligning to paths representing recombinations of sequences across annotation labels.

Label-consistent graph alignment

When label recombination is not desired, we support an alternative approach in which queries are aligned to subgraphs of the joint graph induced by single annotation labels (columns of the annotation matrix). We call this approach label-consistent graph alignment (or alignment to columns), and it is implemented by the SCA algorithm⁶. However, instead of aligning to all the subgraphs independently, we perform the alignment with a single search procedure while keeping track of the annotations corresponding to the alignments.

Trace-consistent graph alignment (TCG-Aligner)

Finally, when input sequences are losslessly encoded in a MetaGraph index using the methodology introduced and evaluated previously⁵, the alignment can be done against those original input sequences of which the respective walks in the graph are called traces. This method is called the trace-consistent graph aligner (TCG-Aligner)⁵.

Column transformations

In addition to the operations mentioned above, MetaGraph supports operations aggregating multiple annotation columns to compute statistics for the k-mers and their counts (abundances). In general, the following formula is used in the aggregation to compute the ith bit of the new annotation column:

$${a}_{\min }\le \mathop{\sum }\limits_{j=1}^{m}1[{v}_{\min }\le {c}_{{ij}}\le {v}_{\max }]\le {a}_{\max },$$

where c_ij is the count (abundance) of the ith k-mer in the jth label, and 1[A] is a Boolean predicate function that evaluates as 1 if the statement A is true and as 0 otherwise. If no counts are associated with the column, we assume that c_ij = 1 for every set bit in the jth annotation column and 0 otherwise. If the sum ${\sum }_{j=1}^{m}1[{v}_{\min }\le {c}_{{ij}}\le {v}_{\max }]$ falls within specified minimum and maximum abundance thresholds a_min and a_max, the bit in the aggregated column for this k-mer is set to 1, and the value of the sum is written as the count associated with that bit. In other words, the resulting aggregated column is always supplemented with a count vector representing the number of original annotation columns with k-mer counts between v_min and v_max, which can be used in downstream analyses as an ordinary count vector.

Seamless distribution and interactive use

In addition to the single-machine use case, where the index is constructed and queried locally, MetaGraph also supports querying indexes provided on a remote server through a client-server architecture. In this approach, a set of graphs and annotations can easily be distributed across multiple machines. Each machine runs MetaGraph in server mode, hosting one or multiple indexes and awaiting queries on a pre-defined port (Extended Data Fig. 1b). This setup makes it straightforward to execute user queries across all indexes hosted on multiple servers. For easy integration of results and coordination of different MetaGraph instances, we provide client interfaces in Python (Extended Data Fig. 1a). Notably, our distribution approach can be used not only for hosting multiple indexes of distinct sources but also when indexing a single dataset of extremely large size, such as SRA. This distribution approach enables virtually unlimited scalability.

MetaGraph API and Python client

For querying large graph indexes interactively, MetaGraph offers an API that allows clients to send requests to a single or multiple MetaGraph servers. When started in server mode, the MetaGraph index will be persistently present in server memory, which will accept HTTP requests on a pre-defined port. To make the querying more convenient, we have also implemented a Python API client as a Python package available at GitHub (https://github.com/ratschlab/metagraph/tree/master/metagraph/api/python).

MetaGraph Online

The search engine MetaGraph Online has a clean and intuitive graphical web user interface (UI; Supplementary Fig. 11), enabling the user to paste an arbitrary sequence and search it against a selected index. Restrictions to search multiple sequences at once are only in place to limit hosting costs. By default, the search is performed through basic k-mer matching. For greater sensitivity, it is also possible for all indexes to additionally align the searched sequence to the annotated graph. If k-mer coordinates or counts are represented in the queried index, the web UI allows retrieving them for the query sequence as well. In addition to user interaction with the web interface, MetaGraph Online provides a web API that allows connecting to the respective servers via their endpoints. That is, any of the hosted indexes can be queried through Python API by connecting to the respective endpoint of the server hosting that index (Extended Data Fig. 1a).

Indexing public read sets from the NCBI SRA

We have split the set of all read sets from SRA (excluding sequencing technologies with high read error rates, see more details in the sections below) into different groups of related samples and constructed a separate MetaGraph index for each group. The groups were defined either using dataset definitions of previous work³² or using the metadata provided by NCBI SRA. As a result, we constructed the following six datasets: SRA-MetaGut, SRA-Microbe, SRA-Fungi, SRA-Plants, SRA-Human and SRA-Metazoa. All of these datasets are listed in the supplementary tables available at GitHub (https://github.com/ratschlab/metagraph_paper_resources) and make up a total of 4.4 Pbp and 2.3 PB of gzip-compressed input sequences, while the indexes make up only 11.6 TB, which corresponds to the overall compression ratio of 193×, or 376 bp per byte.

For constructing the SRA-Microbe index, we used cleaned contigs downloaded from the European Bioinformatics Institute FTP file server provided as supplementary data to BIGSI³². Thus, no additional data preprocessing was needed for this dataset.

For all other datasets, each sample was either transferred and decompressed from NCBI’s mirror on the Google Cloud Platform or, if not available on Google Cloud, downloaded from the ENA onto one of our cloud-compute servers and subjected to k-mer counting with KMC3 (ref. ⁶²) to generate the full k-mer spectrum. If the median k-mer count on the spectrum was less than 2, the sample was further processed without any cleaning. Otherwise, the sample was subjected to cleaning with the standard graph cleaning procedure implemented in MetaGraph, with pruning tips shorter than 2k (for all these datasets k was set to 31) and using an automatically computed k-mer abundance threshold for pruning low-coverage unitigs, with a fallback threshold value of 3. This cleaning procedure was applied for SRA-Fungi, SRA-Plants, SRA-Human and SRA-Metazoa.

For read sets of the SRA-MetaGut dataset, the sequencing depth was typically low, and we therefore applied a more lenient cleaning strategy. Namely, we switched off the singleton filtering (that is, we initially kept all k-mers that appear only once) on the k-mer spectrum and used a constant cleaning threshold of 2 during graph cleaning to remove all unitigs with a median k-mer abundance of 1.

For each dataset, we first constructed a joint canonical graph with k = 31 (including for each indexed k-mer its reverse complement) from the cleaned contigs and then transformed it into a primary graph (storing only one form of each k-mer and representing the other implicitly). Finally, using the same cleaned contigs, we annotated the joint primary graph with sample IDs to construct the annotation matrix. Each input sample thereby formed an individual column of the annotation matrix. The annotation matrix was then transformed to the RowDiff representation for higher compression and faster queries. The graph was, in turn, transformed to the small representation. The exact commands and scripts are available at GitHub (https://github.com/ratschlab/metagraph_paper_resources).

SRA subset composition

Here we provide a detailed description of each of the 6 datasets.

SRA-Microbe

This dataset was first used to construct the BIGSI index³². Consisting of 446,506 microbial genome sequences, this dataset once posed the largest indexed set of raw sequencing data. However, at the time of performing our experiments, it represented only an outdated snapshot of the corresponding part of the SRA. Nevertheless, we decided to keep the same sequence set for this work to enable direct comparison and benchmarking. A complete list of SRA IDs contained in this set is available as file TableS1_SRA_Microbe.tsv.gz (with further information available in TableS10_SRA_Microbe_McCortex_logs.tsv.gz and TableS11_SRA_Microbe_no_logs.tsv) at GitHub (https://github.com/ratschlab/metagraph_paper_resources). For details on how the set of genomes was selected, we refer to the original publication³².

SRA-Fungi

This dataset contains all samples from the SRA assigned to the taxonomic ID 4751 (Fungi) specifying the library sources GENOMIC and METAGENOMIC and excluding samples using platforms PACBIO_SMRT or OXFORD_NANOPORE. In total, this amounts to 149,607 samples processed for cleaning. Out of these, 138,158 (92.3%) could be successfully cleaned and were used to assemble the final MetaGraph index. All sample metadata were requested from NCBI SRA on 25 September 2020 using the BigQuery tool on the Google Cloud Platform.

SRA-Plants

This dataset contains all samples from the SRA assigned to the taxonomic ID 33090 (Viridiplantae), specifying the library source GENOMIC and excluding samples using platforms PACBIO_SMRT or OXFORD_NANOPORE. In total, this amounts to 576,226 samples processed for cleaning. Out of these, 531,736 (92.3%) could be successfully cleaned and were used to assemble the final MetaGraph index. All sample metadata were requested from NCBI SRA on 17 August 2020 using the BigQuery tool on the Google Cloud Platform.

SRA-Human

This dataset contains all samples of assay type WGS, AMPLICON, WXS, WGA, WCS, CLONE, POOLCLONE, or FINISHING from the SRA assigned to the taxonomic ID 9606 (Homo sapiens) specifying the library source GENOMIC and excluding samples using platforms PACBIO_SMRT or OXFORD_NANOPORE. In total, this amounts to 454,252 samples processed for cleaning. Out of these, 436,502 (96.1%) could be successfully cleaned and were used to assemble the final MetaGraph index. All sample metadata were requested from NCBI SRA on 12 December 2020 using the BigQuery tool on the Google Cloud Platform.

SRA-Metazoa

This dataset contains all samples from the SRA assigned to the taxonomic ID 33208 (Metazoa) specifying the library source GENOMIC and excluding samples using platforms PACBIO_SMRT or OXFORD_NANOPORE. In total, this amounts to 906,401 samples processed for cleaning. Out of these, 805,239 (88.8%) could be successfully cleaned and were used to assemble the final MetaGraph index. All sample metadata were requested from NCBI SRA on 17 September 2020 using the BigQuery tool on the Google Cloud Platform.

SRA-MetaGut (human gut microbiome)

This group contains all sequencing samples of the assay type WGS and AMPLICON from the SRA assigned to the taxonomic ID 408170 (human gut metagenome), excluding samples using platforms PACBIO_SMRT and OXFORD_NANOPORE. In total, this amounts to 242,619 samples, where 177,759 (73.3%) were AMPLICON and 64,860 (26.7%) were WGS samples. All these samples were successfully cleaned and were used to assemble the final MetaGraph index. All sample metadata were requested from NCBI SRA on 01 October 2020 using the BigQuery tool on the Google Cloud Platform.

The complete lists of all samples (including the list of successfully cleaned ones) for each subset are available at GitHub (TableS5_SRA_MetaGut.tsv.gz; https://github.com/ratschlab/metagraph_paper_resources).

Indexing GTEx data

The 9,759 raw RNA-seq samples of the GTEx project have become a de facto reference set for the study of human transcriptomics⁴³. All available RNA-seq samples that were part of the version 7 release of GTEx were downloaded through dbGaP to our compute cluster of ETH Zurich. A list of all of the samples used is available at GitHub (TableS7_GTEX.txt; https://github.com/ratschlab/metagraph_paper_resources).

Each sample was individually transformed into a graph using k = 31 and then cleaned with the standard graph cleaning algorithm implemented in MetaGraph, with trimming tips shorter than 2k and using an automatically computed coverage threshold with the fallback value of 2 for removing unitigs with low median k-mer abundance. All resulting cleaned contigs were assembled into a joint canonical de Bruijn graph and then transformed to the final primary graph. Using the typical workflow, the primary joint graph was annotated using the cleaned contigs extracted from each sample, generating one label per sample. All individual annotation columns were finally collected into one matrix and transformed into the RowDiff representation.

When performing the indexing with k-mer counts (row ‘GTEx with counts’ in Table 1), we applied an additional smoothing of k-mer counts within cleaned unitigs to facilitate the compression. We used a smoothing window of size 60. That is, for each k-mer of a cleaned unitig, its count was replaced with the median abundance of 30 k-mers before it in that unitig and 30 after. This smoothing window is much smaller than the expected transcript length. However, it was sufficient to considerably reduce the annotation size (from 184 GB when indexing the original counts to 76 GB).

Indexing the TCGA RNA-seq cohort

TCGA has collected RNA-seq samples on the same order of magnitude from primary tumours, spanning across more than 30 cancer types, constituting a central resource for cancer research⁴². We downloaded the data from the Genomic Data Commons Portal of the NCI. A list containing all processed samples is available as file TableS8_TCGA.tsv.gz at GitHub (https://github.com/ratschlab/metagraph_paper_resources). In total, the index contains 11,095 individual records spanning all available TCGA cancer types. We used the same indexing workflow as for GTEx. Similarly to GTEx, we have also constructed MetaGraph indexes with k-mer counts for TCGA (Table 1).

Indexing environmental metagenome samples (MetaSUB)

This dataset contains 4,220 WMGS samples (the pilot dataset) collected from the environment through the MetaSUB consortium⁸. The swabs were collected at different locations and from different objects, where we also contributed to data collection by collecting swabs from benches, ticket machines and various other objects at different tram stops and train stations in Zurich. When sampling, each swab was annotated with additional data, including the location of sampling, the type of object from which the swab was collected, the material of that object, the elevation above or below sea level, and the station or line where the sample was collected. The swabs were then sent for further processing to the sequencing team. For more details about DNA extraction and sequencing, refer to the original publication⁸.

The raw data (read sets) can be downloaded using the MetaSUB utils⁸⁴. A list of all sample IDs used in this study is available as file TableS6_MetaSUB.csv.gz at GitHub (https://github.com/ratschlab/metagraph_paper_resources).

All input samples were directly assembled into canonical de Bruijn graphs (sample graphs) with k = 41. All graphs were then cleaned with the standard graph cleaning procedure implemented in MetaGraph, with pruning tips shorter than 2k and removing unitigs depending on coverage (automatically computed based on k-mer spectrum). If no threshold could be computed by the algorithm, we used 3 as a fallback value (an evaluation of this cleaning strategy is shown in Supplementary Fig. 15). The cleaned graphs were transformed into primary contigs, which were then used to assemble a joint graph and annotate it. We annotated the graph with sample IDs, which is the most fine-grained annotation we could construct. Thus, each sample was transformed into a single annotation column in the final MetaGraph index. All the annotation columns were finally aggregated into a joint annotation matrix compressed with the RowDiff representation. The additional metadata, such as the location, the object and the surface material, were written to a separate table and could easily be retrieved for any sample ID.

Indexing the RefSeq and UniParc collections

The NCBI RefSeq database⁹ contains a non-redundant collection of genomic DNA sequences (all assembled reference genome sequences), transcripts and proteins.

We indexed all 32,881,422 nucleotide sequences from release 97 of the RefSeq collection, a total of 1.7 Tbp, which takes 483 GB when compressed with gzip -9, using a de Bruijn graph k-mer index with k = 31. We annotated k-mer coordinates within buckets split by Taxonomy IDs (85,375 annotation columns with tuples of k-mer coordinates). The graph was constructed in basic mode (non-canonical, non-primary), as all the sequences of the collection are assemblies and are therefore of a determined orientation.

As expected, the compression ratio (the ratio between the compressed input and the index size) is lower than the raw sequencing read sets at 3.3 bp per byte (Table 1). Our index forms an alternative to the commonly used BLAST database^27,28,85 for competitive high-throughput search⁵.

For amino acids, we indexed the UniParc collection of non-redundant sequences (release 2023_04), containing 210 gigaresidues, using a basic-mode graph and k-mer coordinate annotations to ensure lossless encodings. As expected, the compression ratio is low, at 1.7 amino acids/byte.

Indexing global ocean microbiome (Tara Oceans) data

This collection (v.1.0) contains 34,815 genomes reconstructed from metagenomic datasets from major oceanographical surveys and time-series studies with high coverage of global ocean microbial communities across ocean basins, depth layers and time¹¹. In addition to metagenome-assembled genomes (MAGs) constructed from 1,038 publicly available metagenomes extracted from ocean water samples collected at 215 globally distributed sampling sites, the collection includes a set of single amplified genomes and reference genome sequences of marine bacteria and archaea from other existing databases. For more details on the data composition, refer to the original publication¹¹.

We constructed an index for this collection (summary information is provided in Table 1) using a de Bruijn graph of order k = 31 constructed in the basic mode. This index encodes the coordinates of the k-mers within individual assembled genomes and therefore losslessly represents the input sequences. Notably, it still achieves a compression ratio of 4.2 bp per byte. We also indexed the raw assembled scaffolded contigs (360 gigabases) in an annotated graph with 318 million annotation labels. Owing to the very large number of annotation columns, in contrast to the annotation matrices in other indexes typically represented in the Multi-BRWT format, for this index, we represent the RowDiff-transformed annotation matrix in the RowFlat format for fast row queries.

Indexing a subset of the Logan dataset

Building on the publicly available Logan resource¹³, we grouped samples available in Logan (v.1.0) into subsets based on their phylogenetic relation. On the basis of the organism field in the metadata, we grouped samples together if at least 10,000 samples shared the same label. For the remainder, we mapped all samples to the NCBI taxonomy. Beginning at the species level, we aggregated samples into the same group, if they shared the same taxonomic assignment and formed a group of at least 1,000 samples. For the remaining samples yet ungrouped, we repeated the taxonomic grouping with the next-higher taxonomic levels (genus, family, order, superclass, kingdom, superkingdom) in the same manner. Yet ungrouped samples were assigned to a special group other. We performed the above procedure separately for DNA and RNA samples. Once a group was formed, we split it in subgroups based on the number of cumulative unique contig-level k-mers based on the Logan metadata. We processed the samples in order of reverse release date and started a new subgroup whenever the cumulative k-mer count of 500 billion was exceeded; except for groups metagenome and other, and groups above the superclass level. We indexed each sub-group independently, directly downloading the Logan contig-level samples from the cloud repository. If the contig-level sample was not available, we used the unitig-level sample. We tried downloading each sample up to ten times and marked the sample as not found if none of the attempts were successful. After download, we built a joint canonical graph from all inputs, without cleaning. We then transformed the canonical graph into its primary representation. We then built a joint annotation on the primary graph from the same input samples and subsequently computed the RowDiff sparsification of all annotation columns. Finally, all annotation columns were compressed with Multi-BRWT compression using the default settings. Optionally, we transformed the primary graph from its default stat representation into small.

Indexing the UHGG

We built two indexes using the assemblies from v.1.0 of the dataset. The first index, the UHGG catalogue, contains 4,644 reference genomes, while the other index contains 286,997 non-redundant genomes.

Experiments

This section summarizes the experimental setup for the different results presented in this work.

Benchmarking Mantis, Themisto, Bifrost and Fulgor

For indexing and querying k-mer sets with these lossless indexing tools, we used Mantis (v.0.2.0), Themisto (v.3.2.2), Bifrost (v.1.3.5) and Fulgor (v.3.0.0).

Benchmarking COBS and kmindex

When indexing subsets of the collection of bacterial and viral genomic read sets³² in our evaluation experiments, we used COBS³³ (commit 1cd6df2) with four hash functions and the target false-positive rate of 5%. For kmindex (v.0.5.3), we partitioned the samples into groups based on their k-mer counts, with one group per order of magnitude. We then constructed 28-mer indexes with false-positive rates of 5% for each group to leverage the findere algorithm⁸⁶ for querying 31-mers at a reduced false-positive rate.

Experiment discovery on SRA graphs

We evaluated each graph using 300 randomly selected samples from their respective input samples. To generate a query file for a graph, we randomly selected 100 reads (or the entire read set if fewer than 100 reads are available) from each of the 300 selected samples, resulting in query files of 30,000 reads per graph. To generate auxiliary reads with errors, we selected subsets of the original query sets such that 10 random reads would be selected from each sample. We then introduced substitution errors to these reads with probabilities 0.1%, 1%, 2%, 5% and 10% and insertions–deletions at 10% of the substitution probability using Mutation-Simulator⁸⁷ (v.3.0.1). Given these read sets, we then discarded all reads of which the median k-mer multiplicities were below the unitig cleaning thresholds determined by the cleaning procedure (see the ‘Graph cleaning’ section of the Methods).

We evaluated experiment discovery at two different levels of granularity: (1) mapping to individual sample graphs (Extended Data Fig. 3a); and (2) mapping to joint annotated graphs (Fig. 3a,b). When mapping to joint graphs, we considered only mapping results that retrieved the ground-truth label of each query read. For all granularities, we mapped the reads through both exact k-mer matching and label-consistent sequence-to-graph alignment using SCA⁶. We measure how well the reads aligned as the percentage of characters in the query that are covered by at least one reported mapping.

Human gut resistome and phageome exploration

We queried all AMR genes from the Comprehensive Antibiotic Resistance Database (CARD) database (v.3.2.7)⁴⁴ and all bacteriophages from RefSeq Release 218 (ref. ⁴⁵). We selected bacteriophages by selecting all viral sequences with the term phage in their header. We mapped these sequences to all accessions in the SRA-MetaGut index representing WMGS samples. We recorded an accession as a match to a query if at least 80% of the query’s k-mers exactly matched a k-mer in the accession. We then reduced the pools of AMR genes, SRA accessions and RefSeq bacteriophages to those for which at least one match was found.

To measure the degree of association for each AMR gene family–bacteriophage pair, we computed two binary vectors where each index represents a gut microbiome sample. The first vector indicates the presence of at least one gene match from the gene family and the second vector indicates a match to the phage. We then measure the association using the Matthews correlation coefficient (denoted by Corr_MCC) if both vectors indicate at least 5 present matches (value 1) and at least 5 absent matches (value 0).

When measuring the growth of resistance to antibiotics over time in each continent, we normalized the counts in the confusion matrix before computing Corr_MCC to correct for differing numbers of samples deposited in each year and from each continent. If we denote the number of accessions from a continent C by n_C and the number of accessions from a year Y by n_Y, we normalized the counts by letting each accession from continent C and year Y contribute a count of $c=\frac{{n}_{N}}{{n}_{C}\cdot {n}_{Y}}$ instead of 1. n_N is a scaling factor applied to the four counts in the confusion matrix so that their sum equals the total number of accessions considered. We use the same scaling factor c for each count when fitting a linear regression model to the match counts to determine drug-resistance growth over time for each continent.

We compute P values for each gene-to-phage correlation Corr_MCC and each drug resistance growth linear regression slope s through permutation testing. For each analysis, we compute 100 permutations of the antibiotic and gene family indicator vectors, respectively, and compute P values relative to the resulting null distributions.

In Fig. 4a, we only plot a gene family or a phage if it has at least one significant correlation with Corr_MCC > 0.25. In Fig. 4b, we report all antibiotics for which we measure statistically significant growth in at least one continent (modelled through a binomial GLM using the Python statsmodels package v.0.14.0). All P values are corrected using the Benjamini–Yekutieli procedure to a family-wise error rate of 0.05 and are considered to be significant if they are P < 0.05 after correction (using the Python scipy package v.1.11.3).

Survey of BSJs

To generate the list of candidate trans-junctions, we iterated over all genes present in the GENCODE annotation (v.38) and generated for all transcripts a list of hypothetical BSJs. Assuming that a transcript consists of exons e1, e2, e3 and e4, we would connect the donor site for all exons starting from exon e2 with the acceptor site of all previous exons. The example transcript above would generate the following BSJ candidates: e2-e1, e3-e2, e3-e1, e4-e3, e4-e2, e4-e1. We included only exons with a length of at least 30 bp (which equals k − 1 for k = 31). For all junctions, we extracted the genomic sequences in a ±1.5 × k window around the junction, resulting in query sequences of 90 bp in length. We aligned all 4,052,768 candidate queries against the TCGA and GTEx MetaGraph indexes using the default metagraph align regime. Moreover, we also aligned all queries to the GENCODE (v.38) reference transcriptome using bwa-mem (v.0.7.17-r1188)⁸⁸ and against the hg38 human reference genome (GRCh38.p13, packaged with GENCODE v.38) using STAR (v.2.7.0 f)⁸⁹. We conservatively retained only full-length matches against the graph, which removed most of our candidates, retaining 10,257 and 10,369 candidates for TCGA and GTEx, respectively. After making the candidate list unique over sequences and removing genome and transcriptome hits from the previous STAR and bwa-mem alignments, we retained 2,536 and 1,248 candidates for TCGA and GTEx, respectively. For our validation set, we downloaded accession GSE141693 from the Gene Expression Omnibus, containing experimentally confirmed circular RNAs⁴⁸. We lifted over all annotations from hg19 to hg38 using the UCSC LiftOver tool⁹⁰, which could map all but 70 of the 87,555 circular RNA annotations. For all junctions, we extracted queries as windows ±1.5 × k around the junctions, resulting in 90-bp query sequences. Alignment against MetaGraph indexes, the reference genome and transcriptome, as well as subsequent filtering, was analogous to the in silico generated BSJ queries.

Theoretical model for exact k-mer matching

To derive a theoretically expected number of random matches to our 100-study subset (Fig. 5c), we assume that a random sequence containing 100 k-mers (the average length in the query set used for Fig. 5a) is matched against a sample containing 252,631,773 k-mers (the average sample size in our 100-study subset). We use a hypergeometric distribution (that is, random sampling without replacement) to compute the probability of the query sequence having at least one matching k-mer in a sample. Using this as a success probability for a Bernoulli random variable, we multiply this probability by the number of samples in our 100-study subset to get the expected number of matches to the subset. We then scale this up to the entire SRA.

Cloud-based query experiments

Our index files computed from the Logan resource are available on AWS S3 storage. Analogous to the indexes formed from taxonomic grouping of input samples, we also generated 50 disjoint input sets, each aggregating the raw data of 100 randomly selected SRA studies. For 47 of the groups, we were able to construct a MetaGraph index using the same strategy as for the other Logan-based indexes. For three groupings, no index could be constructed because exceptionally large studies were selected, preventing the assembly of a single joint index. In addition to these sets, we also selected a subset of 21 index groups from the Logan set that contained only metagenome samples, as they represent the largest input diversity. Specifically, we selected metagenome groups 30, 50, 51, 57, 77, 85, 87, 110, 125, 133, 169, 184, 189, 192, 248, 253, 273, 316, 338, 376 and 385. Based on the total number of unique k-mers in each index group, we were able to extrapolate our query results to the entire index set.

We selected reads from the samples used to construct the 100-studies index described in Table 1 to construct query sets of varying sizes. First, we selected 300 random samples from this set. Then, given an integer N, we selected N random reads from each of these 300 samples and filtered these by selecting reads of length at least 100 bp and at most 250 bp. We generated query sets for N ∈ {1, 50, 500, 5,000, 50,000, 5,000,000}. Moreover, we selected 1 out of every 20 reads from the N = 1 query set to generate our smallest query set.

We used an AWS Cloud Formation template (https://github.com/ratschlab/metagraph-open-data) to deploy the querying infrastructure, including AWS Batch compute environments and job queues on r6in, hpc7g and hpc7a EC2 instances, a Step Function and Lambdas for job scheduling and cost estimations. On this setup, we searched all query files in all 47 random study indexes. Smaller queries, for which the execution time is bounded by downloading the index, were processed on r6in instances, while the larger queries, for which the execution time is bounded by the actual runtime, were processed on hpc7g instances, or, when an index did not fit in the available RAM, on hpc7a. After processing all indexes, the total execution time of active EC2 instances was computed to estimate the EC2-related costs of the query. ‘Mountpoint for S3’ was used to avoid staging data on disk and load it directly into RAM instead. Below are the exact MetaGraph CLI parameters that were used for running the queries. Exact match queries: –batch-size 8000000 -p $(nproc) –threads-each 1, 32 or 48 –query-mode labels –min-kmers-fraction-label 0.7 –min-kmers-fraction-graph 0.0 –num-top-labels inf. Fast alignment queries: –batch-size 200000 -p $(nproc) –align-min-exact-match 0.85 –align-min-seed-length 27. Sensitive alignment queries: –batch-size 200000 -p $(nproc) –align-min-exact-match 0.0 –align-min-seed-length 27. A –threads-each parameter was chosen based on the instance type for r6in, hpc7g and hpc7a instances, and $(nproc) depends on the specific instance on which the query was processed. The GitHub repository provides further details on a reproducible deployment of the AWS querying infrastructure.

The architecture and the implementation of MetaGraph Online

The architecture of the MetaGraph Online service is schematically shown in Supplementary Fig. 10. The backend server (Supplementary Fig. 10 (middle)), which is implemented as a Flask v.2.3.0 application, provides a web interface and generates dynamic web pages for submitting search queries and viewing the query results. The user queries are processed and transformed into a search query, which is then passed to the remote servers hosting the MetaGraph indexes (Supplementary Fig. 10 (right)). Once the query has been executed by the respective MetaGraph server, the results are sent back to the backend server and an aggregated summary is rendered to the user. Each MetaGraph index is hosted on a remote server by running the MetaGraph tool in server mode (Supplementary Fig. 10 (right)). The server applications run independently and are distributed across the available machines. Each MetaGraph server receives HTTP requests formed by the central backend server on user search requests. The communication between the central backend server and the other remote MetaGraph servers happens through the Python v.3.9 API. For seamless compatibility, we also made the backend server redirect user requests and provide the same web API for querying MetaGraph servers (for example, from Python or as a simple HTTP request) as if the MetaGraph tool is locally running in server mode. The backend of MetaGraph Online is implemented as a Flask application. This web application is deployed in a Docker container (v.1.13.1; API v.1.26) using the Nginx (v.1.16.1) server as a backend. For each search query from the user, it forms a request accordingly and sends it through the Python API client to the MetaGraph servers hosting the indexes. We run these MetaGraph servers in Docker containers on the same or other machines in a federated manner. Moreover, the web application emulates the usual MetaGraph API by redirecting all requests to the respective individual MetaGraph servers hosting the indexes.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

What's Hot

Haleon taps into tripartite partnership to drive microbiome R&D in S’pore

Creative Ways To Use Vinegar in Cooking

Which Is Better for Blood Sugar?

Efficient and accurate search in petabase-scale sequence repositories

Postdoc Portrait: Priya Prakash | The Scientist

Author Correction: Diversity-oriented synthesis yields novel multistage antimalarial inhibitors

Land-use change undermines the stability of avian functional diversity

Amniotic Fluid Spills Its Secrets in Fetal Development

9 Time-Saving Kitchen Gadgets for Fall at Amazon

Why Exercise Is So Important For Heart Health, From An MD

An Engineered Protein Helps Phagocytes Gobble Up Diseased Cells

How To Get Rid Of Hangnails + Causes From Experts

Haleon taps into tripartite partnership to drive microbiome R&D in S’pore

Creative Ways To Use Vinegar in Cooking

Which Is Better for Blood Sugar?

A Surprising Toxin Source During The Holidays According To Toxicologist

Our Picks

9 Time-Saving Kitchen Gadgets for Fall at Amazon

Why Exercise Is So Important For Heart Health, From An MD

An Engineered Protein Helps Phagocytes Gobble Up Diseased Cells

Gut Health

Haleon taps into tripartite partnership to drive microbiome R&D in S’pore

Creative Ways To Use Vinegar in Cooking

Which Is Better for Blood Sugar?

Subscribe to Updates

What's Hot

Efficient and accurate search in petabase-scale sequence repositories

Representing de Bruijn graphs

Scalable k-mer enumeration and counting

Extracting contigs and unitigs from graphs

Basic, canonical and primary graphs

Graph cleaning

Constructing a joint graph from multiple samples

Graph annotations

Representing graph annotations in MetaGraph

The RowDiff compression technique

Counting de Bruijn graphs

Annotation construction

Dynamic index augmentation and batch updates

Speeding up k-mer matching

Batched sequence search

Sequence search with alignment

MetaGraph-Align

Label-consistent graph alignment

Trace-consistent graph alignment (TCG-Aligner)

Column transformations

Seamless distribution and interactive use

MetaGraph API and Python client

MetaGraph Online

Indexing public read sets from the NCBI SRA

SRA subset composition

SRA-Microbe

SRA-Fungi

SRA-Plants

SRA-Human

SRA-Metazoa

SRA-MetaGut (human gut microbiome)

Indexing GTEx data

Indexing the TCGA RNA-seq cohort

Indexing environmental metagenome samples (MetaSUB)

Indexing the RefSeq and UniParc collections

Indexing global ocean microbiome (Tara Oceans) data

Indexing a subset of the Logan dataset

Indexing the UHGG

Experiments

Benchmarking Mantis, Themisto, Bifrost and Fulgor

Benchmarking COBS and kmindex

Experiment discovery on SRA graphs

Human gut resistome and phageome exploration

Survey of BSJs

Theoretical model for exact k-mer matching

Cloud-based query experiments

The architecture and the implementation of MetaGraph Online

Reporting summary

Related Posts