Revise recombination, chain, and alignment scoring by adamnovak · Pull Request #4887 · vgteam/vg

adamnovak · 2026-04-23T19:43:31Z

Changelog Entry

To be copied to the draft changelog by merger:

Revised Giraffe chain and recombination scoring.
Calling results from nanopore reads are now better than v1.73.0 again.

Description

This includes @dcmonti's changes to recombination penalty computation, and to chain scoring even when recombination-aware mode is not used. I don't completely know what's in here; I think it includes the change to re-score reads after DP and penalize them for recombinations required by the chains, and I think it also adopts more scoring functions from minimap2.

Evaluated as commit 90fd84 against a2eb9e in mainline vg, it improves nanopore-based read calling with DeepVariant as measured with https://github.com/vgteam/recombination-aware-giraffe-experiments:

==> output/experiments/r10y2025_real_full_call_chm13/results/snp_errors.tsv <==
giraffe-k31.w50.W-90fd84-noflags	52247
giraffe-k31.w50.W-a2eb9e-noflags	54997
giraffe-k31.w50.W-v1.73.0-noflags	54722
giraffe-k31.w50.W.path-90fd84-rpc10	51989
giraffe-k31.w50.W.path-a2eb9e-rpc10	54944
giraffe-k31.w50.W.path-v1.73.0-rpc10	54784

==> output/experiments/r10y2025_real_full_call_chm13/results/indel_errors.tsv <==
giraffe-k31.w50.W-90fd84-noflags	308328
giraffe-k31.w50.W-a2eb9e-noflags	309117
giraffe-k31.w50.W-v1.73.0-noflags	309050
giraffe-k31.w50.W.path-90fd84-rpc10	308283
giraffe-k31.w50.W.path-a2eb9e-rpc10	309075
giraffe-k31.w50.W.path-v1.73.0-rpc10	309041

==> output/experiments/r10y2025_real_full_call_chm13/results/total_errors.tsv <==
giraffe-k31.w50.W-90fd84-noflags	360575
giraffe-k31.w50.W-a2eb9e-noflags	364114
giraffe-k31.w50.W-v1.73.0-noflags	363772
giraffe-k31.w50.W.path-90fd84-rpc10	360272
giraffe-k31.w50.W.path-a2eb9e-rpc10	364019
giraffe-k31.w50.W.path-v1.73.0-rpc10	363825

And it also improves HiFi calling:

==> output/experiments/hifi_real_full_call_chm13/results/snp_errors.tsv <==
giraffe-90fd84-rpc0	23482
giraffe-90fd84-rpc10	23171
giraffe-90fd84-rpc3	23342
giraffe-a2eb9e-rpc10	23758

==> output/experiments/hifi_real_full_call_chm13/results/indel_errors.tsv <==
giraffe-90fd84-rpc0	61168
giraffe-90fd84-rpc10	61092
giraffe-90fd84-rpc3	61106
giraffe-a2eb9e-rpc10	61457

==> output/experiments/hifi_real_full_call_chm13/results/total_errors.tsv <==
giraffe-90fd84-rpc0	84650
giraffe-90fd84-rpc10	84263
giraffe-90fd84-rpc3	84448
giraffe-a2eb9e-rpc10	85215

This should overwhelm the nanopore calling accuracy regression introduced in #4862.

…eds_fix_recombination

…ombination

adamnovak · 2026-04-23T20:58:14Z

+        int matches = 0;
+        int mismatches = 0;
+        int gap_opens = 0;
+        vector<size_t> gap_lengths;
+
+        enum class EditType { MATCH, MISMATCH, INS, DEL, COMPLEX, NONE };
+        EditType prev_type = EditType::NONE;
+        size_t current_gap_length = 0;
+
+        auto finish_gap = [&]() {
+            if (current_gap_length > 0) {
+                gap_opens++;
+                gap_lengths.push_back(current_gap_length);
+                current_gap_length = 0;
+            }
+        };
+
+        for (size_t i = 0; i < alignments[alignment_index].path().mapping_size(); ++i) {
+            auto& mapping = alignments[alignment_index].path().mapping(i);
+            for (size_t j = 0; j < mapping.edit_size(); ++j) {
+                auto& edit = mapping.edit(j);
+                if (edit.from_length() == edit.to_length() && edit.from_length() > 0) {
+                    finish_gap();
+                    if (edit.sequence().empty()) {
+                        matches += edit.from_length();
+                        prev_type = EditType::MATCH;
+                    } else {
+                        mismatches += edit.from_length();
+                        prev_type = EditType::MISMATCH;
+                    }
+                } else if (edit.from_length() == 0 && edit.to_length() > 0) {
+                    if (prev_type != EditType::INS) finish_gap();
+                    current_gap_length += edit.to_length();
+                    prev_type = EditType::INS;
+                } else if (edit.from_length() > 0 && edit.to_length() == 0) {
+                    if (prev_type != EditType::DEL) finish_gap();
+                    current_gap_length += edit.from_length();
+                    prev_type = EditType::DEL;
+                } else {
+                    finish_gap();
+                    mismatches += max(edit.from_length(), edit.to_length());
+                    prev_type = EditType::COMPLEX;
+                }
+            }
+        }
+        finish_gap();
+
+        if (matches + mismatches + gap_opens == 0) {
+            continue;
+        }
+
+        double d = max(0.02, static_cast<double>(mismatches + gap_opens) / static_cast<double>(matches + mismatches + gap_opens));
+        double non_match_penalty = static_cast<double>(mismatches + gap_opens) / (2.0 * d);
+
+        double indel_penalty = 0;
+        for (auto& gap_length : gap_lengths) {
+            indel_penalty += log2(1.0 + gap_length);
+        }
+        int adjusted_score = std::round(matches - non_match_penalty - indel_penalty);
+        alignments[alignment_index].set_score(adjusted_score);
+         if (show_work) {
+            #pragma omp critical (cerr)
+            {   
+                cerr << log_name() << "Matches: " << matches << " Mismatches: " << mismatches << " Gap opens: " << gap_opens << " New score: " << adjusted_score << endl;
+            }
+        }


This probably wants to be a function to compute the score of an alignment by minimap2 rules.

Also, is it supposed to penalize a read like 50 points for a 1 base mismatch and a 1 base deletion? There's a test for Giraffe that makes sure we get the "right" score for a nearly perfect match read, and now we don't. The log has:

T2: Matches: 7999 Mismatches: 1 Gap opens: 1 New score: 7948 T2: alignment 0 accepted because 1 of it is from nodes not already used T2: Picked best alignment 1274M1I3683M1X3042M@98299+ score 7948

But the A long read can be correctly aligned test still thinks the score ought to be 7999.

adamnovak · 2026-04-23T22:09:29Z

We're also going to want to plug the new scoring into surject (when it's in long read mode?) by finding all the score_contiguous_alignment() calls there and changing them.

And we need to figure out what the new scoring method's match and mismatch score values are (1 and -1?) so we can use them to get a log_base for MAPQ computation. Or else give up on that bit of theory and just empirically learn a good scaling to get calibrated MAPQs.

…like for surject

adamnovak · 2026-04-27T17:53:52Z

I checked the QQ plots for this and they aren't universally better, but they're good enough to merge, given the mapping and calling improvements.

adamnovak · 2026-04-24T20:27:41Z

+    // Track eval bonus for heuristic comparison (path conservation bonus, used for
+    // selection only without affecting the actual stored score).
+    // Starting from nowhere means full path conservation, so bonus = recomb_penalty.
+    std::vector<int> eval_bonuses(to_chain.size(), std::max(0, recomb_penalty));


Why are we doing all this maxing when the recombination penalty will never be negative?

adamnovak · 2026-04-24T20:33:48Z

+    // Track eval bonus for heuristic comparison (path conservation bonus, used for
+    // selection only without affecting the actual stored score).


It seems like we should be able to pick predecessors based on one score and carry through a different score without completely breaking dynamic programming, but it's not really officially allowed under dynamic programming, right? We might need to introduce more that we're doing an exciting thing here, with less parentheticals and more explicit description about what the numbers in this vector represent and belong to.

adamnovak · 2026-04-24T20:36:07Z

+            auto& current_best = chain_scores[transition.to_anchor];
+            int eval_from = from_source_score.score + eval_bonus_from;
+            int eval_best = current_best.score + eval_bonuses[transition.to_anchor];
+
+            if (eval_from > eval_best) {
+                current_best = from_source_score;
+                eval_bonuses[transition.to_anchor] = eval_bonus_from;


This is the main codepath but it doesn't have any explanation. eval here doesn't mean anything to me by itself. I guess it's meant to suggest "the score we actually use to evaluate the possible alternatives"?

adamnovak · 2026-04-24T20:37:56Z

+    // Implement the logic for minimap2 long indels penalty adjustment.
+    // The new alignment score penalize long continous indels less, using the formula:
+    // score = matches - (mismatches + gap_opens)/2d - sum_{i=1}^{gap_opens} (log_2(1 + gap_length_i))
+    // with d = max{0.02, (mismatches + gap_opens)/(matches + mismatches + gap_opens)}


I probably wanted to cut this comment here since it's now function doc comments.

Suggested change

// Implement the logic for minimap2 long indels penalty adjustment.

// The new alignment score penalize long continous indels less, using the formula:

// score = matches - (mismatches + gap_opens)/2d - sum_{i=1}^{gap_opens} (log_2(1 + gap_length_i))

// with d = max{0.02, (mismatches + gap_opens)/(matches + mismatches + gap_opens)}

// Rescore all the alignments using minimap2 logged-gap-length, read-identity-based scoring

adamnovak · 2026-04-24T20:40:03Z

+            if (chain_index != std::numeric_limits<size_t>::max() && chain_index < chain_rec_counts.size()) {
+                set_annotation(alignments[alignment_index], "chain.rec_count", (double) chain_rec_counts[chain_index]);
+                if (rec_penalty_chain != 0) {
+                    //int64_t penalty = min(static_cast<int64_t>(1), static_cast<int64_t>(rec_penalty_chain)/5) * static_cast<int64_t>(chain_rec_counts[chain_index]);


We should cut this commented-out code before merging.

Suggested change

//int64_t penalty = min(static_cast<int64_t>(1), static_cast<int64_t>(rec_penalty_chain)/5) * static_cast<int64_t>(chain_rec_counts[chain_index]);

adamnovak · 2026-04-24T20:41:06Z

+                if (rec_penalty_chain != 0) {
+                    //int64_t penalty = min(static_cast<int64_t>(1), static_cast<int64_t>(rec_penalty_chain)/5) * static_cast<int64_t>(chain_rec_counts[chain_index]);
+                    int64_t penalty = static_cast<int64_t>(rec_penalty_chain) * static_cast<int64_t>(chain_rec_counts[chain_index]);
+                    int64_t adjusted_score = static_cast<int64_t>(alignments[alignment_index].score()) - penalty;


This is the second thing we're calling adjusted_score. And the last one was an int and this one is an int64_t.

…ombination

adamnovak · 2026-04-27T18:14:49Z

I have code to solve all my complaints.

dcmonti and others added 7 commits March 16, 2026 02:58

added recombination info in output gam and log

82678d4

minimap2 like rescore, progressive recombination penalty

39cc3a5

bugfixes for rescoring

804df61

debug for supported paths across chain

5145cd4

Merge commit '508afe07817e1ae7daba5106a91696a5c6e24588' into score_se…

49482ca

…eds_fix_recombination

Merge commit 'a2eb9ea36c1edf889dae08c0493fe1b2ab4b57b5' into score_se…

90fd844

…eds_fix_recombination

Merge remote-tracking branch 'origin/master' into score_seeds_fix_rec…

0d89ce6

…ombination

adamnovak changed the title ~~Score seeds fix recombination~~ Revise recombination and chain scoring Apr 23, 2026

adamnovak commented Apr 23, 2026

View reviewed changes

Declare the minimap2-based scores to be right

78ad8bb

adamnovak mentioned this pull request Apr 23, 2026

Turn on recombination-awareness in Giraffe by default #4888

Open

3 tasks

adamnovak changed the title ~~Revise recombination and chain scoring~~ Revise recombination, chain, and alignment scoring Apr 23, 2026

Clear explanations before making them to count

31db629

Move minimap2-style scoring logic into functions we could use later, …

d71dc9f

…like for surject

adamnovak commented Apr 27, 2026

View reviewed changes

adamnovak added 5 commits April 27, 2026 13:55

Merge remote-tracking branch 'origin/master' into score_seeds_fix_rec…

9e40f73

…ombination

Stop worrying about negative recombination penalties

11e053e

Make the evaluation bonus system clearer

f665d80

Adjust comments

76791ab

Rename variables

e6d032a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revise recombination, chain, and alignment scoring#4887

Revise recombination, chain, and alignment scoring#4887
adamnovak wants to merge 15 commits intomasterfrom
score_seeds_fix_recombination

adamnovak commented Apr 23, 2026

Uh oh!

adamnovak Apr 23, 2026

Uh oh!

adamnovak commented Apr 23, 2026

Uh oh!

adamnovak commented Apr 27, 2026

Uh oh!

adamnovak Apr 24, 2026

Uh oh!

adamnovak Apr 24, 2026

Uh oh!

adamnovak Apr 24, 2026

Uh oh!

adamnovak Apr 24, 2026

Uh oh!

adamnovak Apr 24, 2026

Uh oh!

adamnovak Apr 24, 2026

Uh oh!

adamnovak commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		// Track eval bonus for heuristic comparison (path conservation bonus, used for
		// selection only without affecting the actual stored score).

Conversation

adamnovak commented Apr 23, 2026

Changelog Entry

Description

Uh oh!

adamnovak Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

adamnovak commented Apr 23, 2026

Uh oh!

adamnovak commented Apr 27, 2026

Uh oh!

adamnovak Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

adamnovak Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

adamnovak Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

adamnovak Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

adamnovak Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

adamnovak Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

adamnovak commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants