-
Notifications
You must be signed in to change notification settings - Fork 6
Description
We are transitioning from using our custom sacCer3_cegr reference genome to using the standard sacCer3 genome from SGD for yeast analysis. This new standard uses a roman numeral chromosome naming system while our custom reference genome uses arabic numerals. This tool will help users with this transition by converting any file back and forth between the chromosome naming systems.
Arabic --> Roman
chr1 --> chrI
chr2 --> chrII
...
chr16 --> chrXVI
chrM --> chrmt
Roman --> Arabic
...
- Should the user be expected to specify the format of the input file to be converted? (GFF/BED/BAM/TAB)
- User option to indicate custom delimiter may be useful. Should this feature be added?
GFF/BED/BAM/TAB-formatted files can be converted using a HashMap on each of the tokens. This assumes all instances of chromosome names occupy their own column. However, some file formats have a comments column that can contain chromosome information, like interaction info with coordinates on other chromosomes. These instances of chromosome names do not exist as their own token.
- Should we simply implement a global replace for each line? Keep in mind that order of conversions is important if a global replace is done (for example, chrII needs to be replaced before chrI). There may be some edge cases that are mis-converted if we do a simple global replace.
- It might be useful for the user to optionally indicate if they wish for the information in the comments column to also be converted. (GFF col9, BED col7+, maybe indicate certain column range of TAB-format file)