converting bcf 1st phase data as in v44#1938
Conversation
455a96e to
aebd6ee
Compare
daviesrob
left a comment
There was a problem hiding this comment.
If updatephasing() is changed as suggested, all of HTSlib's tests still pass.
Most of bcftools' do as well, apart from bcftools convert --haplegendsample which starts incorrectly marking lots of genotypes as partially phased. This is because the process_gt_to_hap() and process_gt_to_hap2() functions currently assume the first phasing bit is always zero. As they're likely broken for VCF4.4, they'll need to be fixed irrespective of this change. Adjusting them to ignore the first phasing bit gets them working again, and after that all bcftools tests pass.
vcf.c
Outdated
| * If the version in header is >= 4.4, no change is made. Otherwise 1st phasing | ||
| * is set if there are no other unphased ones. | ||
| */ | ||
| HTSLIB_EXPORT int update44phasing(bcf_hdr_t *h, bcf1_t *b, int setreset) |
There was a problem hiding this comment.
If we only allow the phasing to be set, we can drop setreset here too.
f3a5f06 to
07c0c5e
Compare
|
The reset made on bcf write is removed. |
fce7ac9 to
8af52e2
Compare
synced_bcf_reader.c
Outdated
| if ( ret < -1 ) files->errnum = bcf_read_error; | ||
| if ( ret < 0 ) break; // no more lines or an error | ||
| //set phasing of 1st value as in vcf v44 | ||
| if ((ret = update44phasing(reader->header, reader->buffer[reader->nbuffer+1]))) { |
There was a problem hiding this comment.
I am curious how this affects speed?
There was a problem hiding this comment.
Initially it was observed to make ~15% overhead. On recheck, last update had more overhead and tweaked further. It shows ~13% overhead with bcftools view with the latest changes
.
e94005e to
ad77218
Compare
|
The BCF conversion is removed, to avoid the performance overhead, based on discussions. With this, BCF and VCF will have different internal binary values/representations. As conversion is made for vcf, merge issue mentioned is fixed for vcf/vcf.gz. When input is bcf, the issue still exists and conversion may still be required. Performance issue will be looked further that it can be used for BCF as well. |
ad77218 to
0335a05
Compare
|
updated to have BCF and VCF output same binary values (updatephasing only for bcf with version <v44) |
Don't try to revert it while writing BCF, even for VCF < 4.4, as doing so may erase data read from an input BCF file. This does change the historic HTSlib behaviour where this bit was always written as zero, however any code relying on that will need updating anyway as VCFv4.4 and 4.5 become more common. Use hts_endian interfaces to read and write phasing values as they may be stored unaligned.
The motivation for this is to enable passing of a pointer to a bcf_hdr_t structure to bcf_readrec(), which currently does not get one. It does always get a pointer for the BGZF handle, so a header struct could be passed in via that if it can be stored somewhere. To enable this while not changing the bgzf API or ABI, extra fields are added to the opaque bgzf_cache_t field. The BGZF_CACHE macro that could be use to disable addition of the cache feature removed as it was always turned on anyway. The cache struct now has to be created for files open for write, although the cache part is not used. The hash type used by the cache is renamed from "cache" to "bgzf_cache" to improve its name-spacing. The interfaces to add, get, and remove private data are put in a new bgzf_internal.h header. The bgzf_cache_t struct definition is also moved there so that the get function can be inlined for faster access to the private data field. The bgzf_cache_t definition is rewritten slightly so that it's not necessary to invoke KHASH_MAP_INIT_INT64() before it in the header file, as doing that would require struct cache_t to be moved from bgzf.c to the new header as well. Instead, typedef kh_bgzf_cache_t is used in place of khash(bgzf_cache), and unsigned int instead of khint_t.
For bcf files, the header pointer hasn't always been passed into bcf_read(), especially when using iterators. As having it available would be useful for VCF 4.4+ support, this works around its absence by attaching a pointer to the header in BGZF private data, which was previously unused for vcf/bcf. It also adds reference counting to the header so that it can be cleaned up safely irrespective of whether hts_close() or bcf_hdr_destroy() was called first. To avoid ABI breakage, the reference count is stored in the bcf_hdr_aux_t struct.
BCF saved by versions of HTSlib before 1.22 will always store the first phasing bit as 0. For consistency with the VCF reader, update this bit when reading BCF so that is is set if all other phasing bits are also set.
Phasing should now be fixed up in bcf_read()/vcf_read(), so there's no need to try again in bcf_get_format_values().
By noting that we're only interested in the least-significant bit of each GT value, it's possible to reduce the number of branches in this function by doing bit manipulations on the first byte of each stored value. The common haploid and diploid cases are also specialised so the inner loop on ploidy can be avoided for those cases.
68657ef to
9b9a005
Compare
|
Rebased and tidied up... |
Fix for #1932
It was discussed earlier to keep the phasing data in VCF44 format internally as that easily resolves the issue mentioned (will necessitate some changes in bcftools as there are a few checks with v4.3 type phase values - in convert tests) with VCF checks.
Data needs a consistent binary representation, irrespective of VCF/BCF source, so the same conversion is needed for BCF data as well, from v4.x to v4.4 in terms of phasing values. This conversion is made on read, for bcf with version < v44. While writing, it is again converted that data is stored without the change made, if version < v44. This adds ~15% overhead it seems.
--removed the mention of a few changes that seemed required in bcftools, it worked fine without any such change with later checks--