-
Notifications
You must be signed in to change notification settings - Fork 16
Queries
Fulgor supports two types of queries: pseudoalignment and k-mer conservation.
Given a query string Q, a pseudoalignment query returns the set of colors (i.e. the references) containing all of its k-mers.
If the threshold parameter -r tau is set, the resulting colors contain at least tau percent of the k-mers of Q.
Warning
ASCII output can be very large. If your disk space is limited, consider using --format binary or --format compressed to reduce the memory footprint of the output.
The result of a single pseudoalignment query is a line containing the following tab-separated values:
query_id num_colors color_0 color_1 ... color_n
where colors are sorted in increasing order.
query_id is the index of the query inside the .fastq file provided with the parameter -q. To retrieve its name, execute the bash command
awk -v n=query_id 'NR == (n-1)*4+1' query_filename
Example
21 1 0 949 3 0 3 7 203 1 0 953 2 0 8 42 0This means that:
- the k-mers (all, or at least
tau) of query 21 are found only in reference 0. The same is true for query 203.- the k-mers (all, or at least
tau) of query 949 are found in references 0, 3, 7.- the k-mers (all, or at least
tau) of query 953 are found in references 0, 8.- the k-mers (all, or at least
tau) of query 42 were not found in any reference.
Binary output follows the same structure of the ASCII output. Every value is encoded as an unsigned 32-bit integer, removing the need for tabs and line breaks.
** Example ** The previous example would be encoded as follows, using hexadecimal representation.
00000015 00000001 00000000 000003b5 00000003 00000000 00000003 00000007 000000cb 00000001 00000000 000003b9 00000002 00000000 00000008 0000002a 00000000Note that spaces and line breaks are used only to visually separate the values.
The compressed formats starts with two unsigned 32-bit values, sparse_threshold and very_dense_threshold, which are used to later decompress the color lists.
In particular:
- if the number of colors in a result is strictly less than
sparse_threshold, it is written as the elias-delta code representation of the gaps of consecutive values; - if the number of colors in a result is greater than or equal to
very_dense_threshold, the complement result set (all integers not part of the result) is written as the elias-delta code representation of the gaps of consecutive values; - otherwise the color sets are represented as a binary string of length
N(the total number of colors in the index), the ith bit is set to 1 iffiis part of the result.
TODO example
TODO
TODO