-
Notifications
You must be signed in to change notification settings - Fork 14
Description
I'm experiencing a puzzling issue while using join to filter out common variants. I wonder if I'm perhaps missing an environment variable or something that might explain and correct this behavior.
For a minimum reproducible example, my input file looks like:
$ cat input.gor
CHROM POS ID REF ALT QUAL FILTER CALLER pn de_identified_subject GT AD AF DP GQ Allele
chr1 22375 chr1:22375 T C 10.61 PASS SNP SUBJECT_XYZ_G38 SUBJECT_XYZ 0/1 4,4 0.500 8 9 C
chr8 73281635 chr8:73281635 C G 50.0 PASS SNP SUBJECT_XYZ_G38 SUBJECT_XYZ 1|0 14,14 0.500 28 47 G
chrY 12666523 chrY:12666523 CT C 62.03 PASS SNP SUBJECT_XYZ_G38 SUBJECT_XYZ 1 0,4 1.000 4 62 C
The aim is to remove the chr8:73281635 C>G variant which is relatively common (AF of approx 0.5 in gnomAD). I have a file of common variants that looks like:
$ head -5 common_variants.gor
CHROM POS STOP REF ALT _1 _2 AF
chr1 10067 10067 T TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC 2 119758 1.67003e-05
chr1 10108 10108 C CAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT 2 10090 0.000198216
chr1 10111 10111 C A 1 44330 2.25581e-05
chr1 10114 10115 TA T 4 22240 0.000179856
...
(The entire file is ~15G)
Indeed, the variant above is in that file:
$ grep -P "chr8\t73281635\t73281635\tC\tG" common_variants.gor
chr8 73281635 73281635 C G 59729 119550 0.499615
Now, if I run gorpipe with the following, the chr8 variant remains:
<path to software>/gor-5.7.0/gor/gorscripts/build/install/gorscripts/bin/gorpipe 'gor input.gor | sort genome | join -n -snpsnp common_variants.gor -rprefix common -xl REF,Allele -xr REF,ALT' > with_full_variants.gor
Now, if I filter my large file of common variants to only those on chr8:
awk 'NR==1 || /^chr8/' common_variants.gor > chr8.gor
Then run the same command with this chr8.gor:
<path to software>/gor-5.7.0/gor/gorscripts/build/install/gorscripts/bin/gorpipe 'gor input.gor | sort genome | join -n -snpsnp chr8.gor -rprefix common -xl REF,Allele -xr REF,ALT' > with_chr8_variants.gor
then it works as expected, and the chr8 variant is removed.
I tried giving it a tmp directory in case the large common_variants.gor file was causing issues by using: export GOR_GORPIPE_OPTS="-Djava.io.tmpdir=<path to tmp>/gor_tmp" but that did not change the result. I did not receive any warnings. Clearly the logic works, so I assume this is related to some secondary issue related to file size.