Skip to content

Revise recombination, chain, and alignment scoring#4887

Open
adamnovak wants to merge 15 commits intomasterfrom
score_seeds_fix_recombination
Open

Revise recombination, chain, and alignment scoring#4887
adamnovak wants to merge 15 commits intomasterfrom
score_seeds_fix_recombination

Conversation

@adamnovak
Copy link
Copy Markdown
Member

Changelog Entry

To be copied to the draft changelog by merger:

  • Revised Giraffe chain and recombination scoring.
  • Calling results from nanopore reads are now better than v1.73.0 again.

Description

This includes @dcmonti's changes to recombination penalty computation, and to chain scoring even when recombination-aware mode is not used. I don't completely know what's in here; I think it includes the change to re-score reads after DP and penalize them for recombinations required by the chains, and I think it also adopts more scoring functions from minimap2.

Evaluated as commit 90fd84 against a2eb9e in mainline vg, it improves nanopore-based read calling with DeepVariant as measured with https://github.com/vgteam/recombination-aware-giraffe-experiments:

==> output/experiments/r10y2025_real_full_call_chm13/results/snp_errors.tsv <==
giraffe-k31.w50.W-90fd84-noflags	52247
giraffe-k31.w50.W-a2eb9e-noflags	54997
giraffe-k31.w50.W-v1.73.0-noflags	54722
giraffe-k31.w50.W.path-90fd84-rpc10	51989
giraffe-k31.w50.W.path-a2eb9e-rpc10	54944
giraffe-k31.w50.W.path-v1.73.0-rpc10	54784

==> output/experiments/r10y2025_real_full_call_chm13/results/indel_errors.tsv <==
giraffe-k31.w50.W-90fd84-noflags	308328
giraffe-k31.w50.W-a2eb9e-noflags	309117
giraffe-k31.w50.W-v1.73.0-noflags	309050
giraffe-k31.w50.W.path-90fd84-rpc10	308283
giraffe-k31.w50.W.path-a2eb9e-rpc10	309075
giraffe-k31.w50.W.path-v1.73.0-rpc10	309041

==> output/experiments/r10y2025_real_full_call_chm13/results/total_errors.tsv <==
giraffe-k31.w50.W-90fd84-noflags	360575
giraffe-k31.w50.W-a2eb9e-noflags	364114
giraffe-k31.w50.W-v1.73.0-noflags	363772
giraffe-k31.w50.W.path-90fd84-rpc10	360272
giraffe-k31.w50.W.path-a2eb9e-rpc10	364019
giraffe-k31.w50.W.path-v1.73.0-rpc10	363825

And it also improves HiFi calling:

==> output/experiments/hifi_real_full_call_chm13/results/snp_errors.tsv <==
giraffe-90fd84-rpc0	23482
giraffe-90fd84-rpc10	23171
giraffe-90fd84-rpc3	23342
giraffe-a2eb9e-rpc10	23758

==> output/experiments/hifi_real_full_call_chm13/results/indel_errors.tsv <==
giraffe-90fd84-rpc0	61168
giraffe-90fd84-rpc10	61092
giraffe-90fd84-rpc3	61106
giraffe-a2eb9e-rpc10	61457

==> output/experiments/hifi_real_full_call_chm13/results/total_errors.tsv <==
giraffe-90fd84-rpc0	84650
giraffe-90fd84-rpc10	84263
giraffe-90fd84-rpc3	84448
giraffe-a2eb9e-rpc10	85215

This should overwhelm the nanopore calling accuracy regression introduced in #4862.

@adamnovak adamnovak changed the title Score seeds fix recombination Revise recombination and chain scoring Apr 23, 2026
Comment thread src/minimizer_mapper_from_chains.cpp Outdated
Comment on lines +858 to +923
int matches = 0;
int mismatches = 0;
int gap_opens = 0;
vector<size_t> gap_lengths;

enum class EditType { MATCH, MISMATCH, INS, DEL, COMPLEX, NONE };
EditType prev_type = EditType::NONE;
size_t current_gap_length = 0;

auto finish_gap = [&]() {
if (current_gap_length > 0) {
gap_opens++;
gap_lengths.push_back(current_gap_length);
current_gap_length = 0;
}
};

for (size_t i = 0; i < alignments[alignment_index].path().mapping_size(); ++i) {
auto& mapping = alignments[alignment_index].path().mapping(i);
for (size_t j = 0; j < mapping.edit_size(); ++j) {
auto& edit = mapping.edit(j);
if (edit.from_length() == edit.to_length() && edit.from_length() > 0) {
finish_gap();
if (edit.sequence().empty()) {
matches += edit.from_length();
prev_type = EditType::MATCH;
} else {
mismatches += edit.from_length();
prev_type = EditType::MISMATCH;
}
} else if (edit.from_length() == 0 && edit.to_length() > 0) {
if (prev_type != EditType::INS) finish_gap();
current_gap_length += edit.to_length();
prev_type = EditType::INS;
} else if (edit.from_length() > 0 && edit.to_length() == 0) {
if (prev_type != EditType::DEL) finish_gap();
current_gap_length += edit.from_length();
prev_type = EditType::DEL;
} else {
finish_gap();
mismatches += max(edit.from_length(), edit.to_length());
prev_type = EditType::COMPLEX;
}
}
}
finish_gap();

if (matches + mismatches + gap_opens == 0) {
continue;
}

double d = max(0.02, static_cast<double>(mismatches + gap_opens) / static_cast<double>(matches + mismatches + gap_opens));
double non_match_penalty = static_cast<double>(mismatches + gap_opens) / (2.0 * d);

double indel_penalty = 0;
for (auto& gap_length : gap_lengths) {
indel_penalty += log2(1.0 + gap_length);
}
int adjusted_score = std::round(matches - non_match_penalty - indel_penalty);
alignments[alignment_index].set_score(adjusted_score);
if (show_work) {
#pragma omp critical (cerr)
{
cerr << log_name() << "Matches: " << matches << " Mismatches: " << mismatches << " Gap opens: " << gap_opens << " New score: " << adjusted_score << endl;
}
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably wants to be a function to compute the score of an alignment by minimap2 rules.

Also, is it supposed to penalize a read like 50 points for a 1 base mismatch and a 1 base deletion? There's a test for Giraffe that makes sure we get the "right" score for a nearly perfect match read, and now we don't. The log has:

T2:	Matches: 7999 Mismatches: 1 Gap opens: 1 New score: 7948
T2:	alignment 0 accepted because 1 of it is from nodes not already used
T2:	Picked best alignment 1274M1I3683M1X3042M@98299+ score 7948

But the A long read can be correctly aligned test still thinks the score ought to be 7999.

@adamnovak adamnovak changed the title Revise recombination and chain scoring Revise recombination, chain, and alignment scoring Apr 23, 2026
@adamnovak
Copy link
Copy Markdown
Member Author

We're also going to want to plug the new scoring into surject (when it's in long read mode?) by finding all the score_contiguous_alignment() calls there and changing them.

And we need to figure out what the new scoring method's match and mismatch score values are (1 and -1?) so we can use them to get a log_base for MAPQ computation. Or else give up on that bit of theory and just empirically learn a good scaling to get calibrated MAPQs.

@adamnovak
Copy link
Copy Markdown
Member Author

I checked the QQ plots for this and they aren't universally better, but they're good enough to merge, given the mapping and calling improvements.

Comment thread src/algorithms/chain_items.cpp Outdated
// Track eval bonus for heuristic comparison (path conservation bonus, used for
// selection only without affecting the actual stored score).
// Starting from nowhere means full path conservation, so bonus = recomb_penalty.
std::vector<int> eval_bonuses(to_chain.size(), std::max(0, recomb_penalty));
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we doing all this maxing when the recombination penalty will never be negative?

Comment thread src/algorithms/chain_items.cpp Outdated
Comment on lines +560 to +561
// Track eval bonus for heuristic comparison (path conservation bonus, used for
// selection only without affecting the actual stored score).
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we should be able to pick predecessors based on one score and carry through a different score without completely breaking dynamic programming, but it's not really officially allowed under dynamic programming, right? We might need to introduce more that we're doing an exciting thing here, with less parentheticals and more explicit description about what the numbers in this vector represent and belong to.

Comment on lines +697 to +703
auto& current_best = chain_scores[transition.to_anchor];
int eval_from = from_source_score.score + eval_bonus_from;
int eval_best = current_best.score + eval_bonuses[transition.to_anchor];

if (eval_from > eval_best) {
current_best = from_source_score;
eval_bonuses[transition.to_anchor] = eval_bonus_from;
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main codepath but it doesn't have any explanation. eval here doesn't mean anything to me by itself. I guess it's meant to suggest "the score we actually use to evaluate the possible alternatives"?

Comment thread src/minimizer_mapper_from_chains.cpp Outdated
Comment on lines +849 to +852
// Implement the logic for minimap2 long indels penalty adjustment.
// The new alignment score penalize long continous indels less, using the formula:
// score = matches - (mismatches + gap_opens)/2d - sum_{i=1}^{gap_opens} (log_2(1 + gap_length_i))
// with d = max{0.02, (mismatches + gap_opens)/(matches + mismatches + gap_opens)}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably wanted to cut this comment here since it's now function doc comments.

Suggested change
// Implement the logic for minimap2 long indels penalty adjustment.
// The new alignment score penalize long continous indels less, using the formula:
// score = matches - (mismatches + gap_opens)/2d - sum_{i=1}^{gap_opens} (log_2(1 + gap_length_i))
// with d = max{0.02, (mismatches + gap_opens)/(matches + mismatches + gap_opens)}
// Rescore all the alignments using minimap2 logged-gap-length, read-identity-based scoring

Comment thread src/minimizer_mapper_from_chains.cpp Outdated
if (chain_index != std::numeric_limits<size_t>::max() && chain_index < chain_rec_counts.size()) {
set_annotation(alignments[alignment_index], "chain.rec_count", (double) chain_rec_counts[chain_index]);
if (rec_penalty_chain != 0) {
//int64_t penalty = min(static_cast<int64_t>(1), static_cast<int64_t>(rec_penalty_chain)/5) * static_cast<int64_t>(chain_rec_counts[chain_index]);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should cut this commented-out code before merging.

Suggested change
//int64_t penalty = min(static_cast<int64_t>(1), static_cast<int64_t>(rec_penalty_chain)/5) * static_cast<int64_t>(chain_rec_counts[chain_index]);

Comment thread src/minimizer_mapper_from_chains.cpp Outdated
if (rec_penalty_chain != 0) {
//int64_t penalty = min(static_cast<int64_t>(1), static_cast<int64_t>(rec_penalty_chain)/5) * static_cast<int64_t>(chain_rec_counts[chain_index]);
int64_t penalty = static_cast<int64_t>(rec_penalty_chain) * static_cast<int64_t>(chain_rec_counts[chain_index]);
int64_t adjusted_score = static_cast<int64_t>(alignments[alignment_index].score()) - penalty;
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the second thing we're calling adjusted_score. And the last one was an int and this one is an int64_t.

@adamnovak
Copy link
Copy Markdown
Member Author

I have code to solve all my complaints.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants