Revise recombination, chain, and alignment scoring#4887
Revise recombination, chain, and alignment scoring#4887
Conversation
…eds_fix_recombination
…eds_fix_recombination
| int matches = 0; | ||
| int mismatches = 0; | ||
| int gap_opens = 0; | ||
| vector<size_t> gap_lengths; | ||
|
|
||
| enum class EditType { MATCH, MISMATCH, INS, DEL, COMPLEX, NONE }; | ||
| EditType prev_type = EditType::NONE; | ||
| size_t current_gap_length = 0; | ||
|
|
||
| auto finish_gap = [&]() { | ||
| if (current_gap_length > 0) { | ||
| gap_opens++; | ||
| gap_lengths.push_back(current_gap_length); | ||
| current_gap_length = 0; | ||
| } | ||
| }; | ||
|
|
||
| for (size_t i = 0; i < alignments[alignment_index].path().mapping_size(); ++i) { | ||
| auto& mapping = alignments[alignment_index].path().mapping(i); | ||
| for (size_t j = 0; j < mapping.edit_size(); ++j) { | ||
| auto& edit = mapping.edit(j); | ||
| if (edit.from_length() == edit.to_length() && edit.from_length() > 0) { | ||
| finish_gap(); | ||
| if (edit.sequence().empty()) { | ||
| matches += edit.from_length(); | ||
| prev_type = EditType::MATCH; | ||
| } else { | ||
| mismatches += edit.from_length(); | ||
| prev_type = EditType::MISMATCH; | ||
| } | ||
| } else if (edit.from_length() == 0 && edit.to_length() > 0) { | ||
| if (prev_type != EditType::INS) finish_gap(); | ||
| current_gap_length += edit.to_length(); | ||
| prev_type = EditType::INS; | ||
| } else if (edit.from_length() > 0 && edit.to_length() == 0) { | ||
| if (prev_type != EditType::DEL) finish_gap(); | ||
| current_gap_length += edit.from_length(); | ||
| prev_type = EditType::DEL; | ||
| } else { | ||
| finish_gap(); | ||
| mismatches += max(edit.from_length(), edit.to_length()); | ||
| prev_type = EditType::COMPLEX; | ||
| } | ||
| } | ||
| } | ||
| finish_gap(); | ||
|
|
||
| if (matches + mismatches + gap_opens == 0) { | ||
| continue; | ||
| } | ||
|
|
||
| double d = max(0.02, static_cast<double>(mismatches + gap_opens) / static_cast<double>(matches + mismatches + gap_opens)); | ||
| double non_match_penalty = static_cast<double>(mismatches + gap_opens) / (2.0 * d); | ||
|
|
||
| double indel_penalty = 0; | ||
| for (auto& gap_length : gap_lengths) { | ||
| indel_penalty += log2(1.0 + gap_length); | ||
| } | ||
| int adjusted_score = std::round(matches - non_match_penalty - indel_penalty); | ||
| alignments[alignment_index].set_score(adjusted_score); | ||
| if (show_work) { | ||
| #pragma omp critical (cerr) | ||
| { | ||
| cerr << log_name() << "Matches: " << matches << " Mismatches: " << mismatches << " Gap opens: " << gap_opens << " New score: " << adjusted_score << endl; | ||
| } | ||
| } |
There was a problem hiding this comment.
This probably wants to be a function to compute the score of an alignment by minimap2 rules.
Also, is it supposed to penalize a read like 50 points for a 1 base mismatch and a 1 base deletion? There's a test for Giraffe that makes sure we get the "right" score for a nearly perfect match read, and now we don't. The log has:
T2: Matches: 7999 Mismatches: 1 Gap opens: 1 New score: 7948
T2: alignment 0 accepted because 1 of it is from nodes not already used
T2: Picked best alignment 1274M1I3683M1X3042M@98299+ score 7948
But the A long read can be correctly aligned test still thinks the score ought to be 7999.
|
We're also going to want to plug the new scoring into surject (when it's in long read mode?) by finding all the And we need to figure out what the new scoring method's match and mismatch score values are (1 and -1?) so we can use them to get a |
|
I checked the QQ plots for this and they aren't universally better, but they're good enough to merge, given the mapping and calling improvements. |
| // Track eval bonus for heuristic comparison (path conservation bonus, used for | ||
| // selection only without affecting the actual stored score). | ||
| // Starting from nowhere means full path conservation, so bonus = recomb_penalty. | ||
| std::vector<int> eval_bonuses(to_chain.size(), std::max(0, recomb_penalty)); |
There was a problem hiding this comment.
Why are we doing all this maxing when the recombination penalty will never be negative?
| // Track eval bonus for heuristic comparison (path conservation bonus, used for | ||
| // selection only without affecting the actual stored score). |
There was a problem hiding this comment.
It seems like we should be able to pick predecessors based on one score and carry through a different score without completely breaking dynamic programming, but it's not really officially allowed under dynamic programming, right? We might need to introduce more that we're doing an exciting thing here, with less parentheticals and more explicit description about what the numbers in this vector represent and belong to.
| auto& current_best = chain_scores[transition.to_anchor]; | ||
| int eval_from = from_source_score.score + eval_bonus_from; | ||
| int eval_best = current_best.score + eval_bonuses[transition.to_anchor]; | ||
|
|
||
| if (eval_from > eval_best) { | ||
| current_best = from_source_score; | ||
| eval_bonuses[transition.to_anchor] = eval_bonus_from; |
There was a problem hiding this comment.
This is the main codepath but it doesn't have any explanation. eval here doesn't mean anything to me by itself. I guess it's meant to suggest "the score we actually use to evaluate the possible alternatives"?
| // Implement the logic for minimap2 long indels penalty adjustment. | ||
| // The new alignment score penalize long continous indels less, using the formula: | ||
| // score = matches - (mismatches + gap_opens)/2d - sum_{i=1}^{gap_opens} (log_2(1 + gap_length_i)) | ||
| // with d = max{0.02, (mismatches + gap_opens)/(matches + mismatches + gap_opens)} |
There was a problem hiding this comment.
I probably wanted to cut this comment here since it's now function doc comments.
| // Implement the logic for minimap2 long indels penalty adjustment. | |
| // The new alignment score penalize long continous indels less, using the formula: | |
| // score = matches - (mismatches + gap_opens)/2d - sum_{i=1}^{gap_opens} (log_2(1 + gap_length_i)) | |
| // with d = max{0.02, (mismatches + gap_opens)/(matches + mismatches + gap_opens)} | |
| // Rescore all the alignments using minimap2 logged-gap-length, read-identity-based scoring |
| if (chain_index != std::numeric_limits<size_t>::max() && chain_index < chain_rec_counts.size()) { | ||
| set_annotation(alignments[alignment_index], "chain.rec_count", (double) chain_rec_counts[chain_index]); | ||
| if (rec_penalty_chain != 0) { | ||
| //int64_t penalty = min(static_cast<int64_t>(1), static_cast<int64_t>(rec_penalty_chain)/5) * static_cast<int64_t>(chain_rec_counts[chain_index]); |
There was a problem hiding this comment.
We should cut this commented-out code before merging.
| //int64_t penalty = min(static_cast<int64_t>(1), static_cast<int64_t>(rec_penalty_chain)/5) * static_cast<int64_t>(chain_rec_counts[chain_index]); |
| if (rec_penalty_chain != 0) { | ||
| //int64_t penalty = min(static_cast<int64_t>(1), static_cast<int64_t>(rec_penalty_chain)/5) * static_cast<int64_t>(chain_rec_counts[chain_index]); | ||
| int64_t penalty = static_cast<int64_t>(rec_penalty_chain) * static_cast<int64_t>(chain_rec_counts[chain_index]); | ||
| int64_t adjusted_score = static_cast<int64_t>(alignments[alignment_index].score()) - penalty; |
There was a problem hiding this comment.
This is the second thing we're calling adjusted_score. And the last one was an int and this one is an int64_t.
|
I have code to solve all my complaints. |
Changelog Entry
To be copied to the draft changelog by merger:
Description
This includes @dcmonti's changes to recombination penalty computation, and to chain scoring even when recombination-aware mode is not used. I don't completely know what's in here; I think it includes the change to re-score reads after DP and penalize them for recombinations required by the chains, and I think it also adopts more scoring functions from minimap2.
Evaluated as commit
90fd84againsta2eb9ein mainline vg, it improves nanopore-based read calling with DeepVariant as measured with https://github.com/vgteam/recombination-aware-giraffe-experiments:And it also improves HiFi calling:
This should overwhelm the nanopore calling accuracy regression introduced in #4862.