Generates monomeric annotation of high order repeats of human alpha satellites.
The pipeline requires the following components:
- awk, sed, and other standard unix command-line programs;
- nhmmer from HMMER;
- HMM profiles of alpha satellite HOR monomers, for example AS-SF-HORs-SF1-divergent-hmmer3.0.hmm;
- bedmap;
- hmmertblout2bed.
Any sequence in which you would like to analyze alpha satellite HORs in FASTA format:
- genomic assemblies or long contigs — to view structural HOR variants, or
- short read runs from SRA — to count monomers.
To run this job, first modify any required slurm settings in the associated slurm file,
second you probably need to change match pattern files=( $(find "$dir" -name "*.fasta" -print) )
in both hmmer-array.slurm and slurm-hmmer-submit.sh files, and a path to the hmmertblout2bed script,
and then run sbatch with that slurm submission script:
sbatch slurm-hmmer-submit.sh /path/to/input/directory /path/and/name/of/the/profile.hmm
To run this job, you probably need to change match pattern
files=( $(find "$dir" -name "*.fasta" -print) ) in hmmer-run.sh file, and a path
to the hmmertblout2bed script, and then run script:
hmmer-run.sh /path/to/input/directory /path/and/name/of/the/profile.hmm
Pipeline produces the tab delimited BED format file containing one feature of interest per line. If you processing genome assembly the annotation can be viewed as UCSC Genome Browser custom track, for example SF1 Alpha Satellite HORs in hg38.
- For long contigs and assemblies you can generate reports with Structural Variants of Alpha Satellite Higher-Order Repeats via SVASHOR.
- For short reads you can collect stats from BED files via bed-HOR-stats.
