Population and subspecies diversity at mouse centromeres

Arora et. al (2021)

Using publicly available whole genome sequencing data from diverse mouse strains and a customized k-mer based analysis, we comprehensively characterized multiple dimensions of mouse centromere variation.

This manuscript utilizes publicly available sequencing data from 3 different projects:

Sanger mouse genomes project: ftp://ftp-mouse.sanger.ac.uk/current_bams/
Wild mouse genomes project: http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/
Mus caroli and Mus pahari genome assemblies: Repeat associated mechanisms of genome evolution and function revealed by the Mus caroli and Mus pahari genomes

This repository contains the code and figures in the manuscript. This analysis was performed on a high performance computing cluster, and utilizes scripts written in bash, R, and Python.

Scripts used in the analysis:

Downloading and processing publicly available data

Sanger mouse genomes project

Note: The sanger mouse genomes project genomes sometimes had multiple sequencing libraries per strain. To ensure sequencing run was not confounding results, we processed and analyzed them separately, and once we confirmed there were no large differences, we went ahead and combined them. sanger_bam_library_identifiers.sh: Extract sanger bams sequencing library information

sanger_split_libraries.sh: Split sanger bams by sequencing libraries

Wild mouse genomes project

wild_fastq.sh: Download wild mouse bam files into fastq format

Mus caroli and Mus pahari genome assemblies

caroli_pahari_process_fastq.sh: Download M. caroli and M. pahari fastq reads, map them to Mus musculus reference (mm10), remove optical duplicates, and convert bam files back to fastq format before proceeding with k-mer analysis.

Processing fastq data to make k-mer tables

kmer_composition.py: Python script to read in a fastq file and output a k-mer table, with k-mer in first column and frequency of occurence in the fastq file in the second column

Mapping k-mers to centromere consensus sequence

k31txt.to.fastq.py: Convert k-mer table into fastq format for mapping to centromere consensus

centromere_kmers.sh: Use bwa to map k-mers to centromere consensus (at the top of the script is the centromere consensus fasta file). Produces output mapped k-mer sam file.

process_mapped_sam.sh: process mapped k-mer sam file to make a dataframe

GC correction for copy number estimation

GCcontent.py: Subsets k-mers for those that occur only once in the reference genome and calculates their GC%

GC_calculation.R: calculate GC% of each k-mer in the table

GCLoess.R: Loess regression on subsetted k-mers that only occur once in the mouse reference genome (mm10). Loess regression based on GC content.

GCcorrection.R: Correct each sample's raw k-mer counts by GC Loess regression predicted count

Using k-mers mapped to centromere consensus sequence to quantify polymorphisms

Consensus_script.R: Calculate nucleotide frequency at each position on minor and major consensus sequences using k-mer frequency and it's mapping position.

Mapping reads to centromere consensus sequence to calculate the centromere diversity index

CentromereMapping.sh: Maps sequencing reads to centromere consensus sequence

CentromereMapped_LocationSplit.py: Split mapped reads by location on consensus sequence they map to, and output a csv file compiling reads that map to each position on consensus sequence

Files used for plotting

final.set1.k31.correctedcount.txt: contains centromere 31-mers, readcount normalized count, and GC corrected counts.

Pi_Estimation_metadata: contains Centromere Diversity Index (CDI) values for samples

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
Figure1		Figure1
Figure2		Figure2
Figure3		Figure3
Figure4		Figure4
Figure5		Figure5
Figure6		Figure6
SupFigure1		SupFigure1
SupFigure2		SupFigure2
SupFigure3		SupFigure3
SupFigure5		SupFigure5
SupFigure6		SupFigure6
SupFigure7		SupFigure7
SupFigure8		SupFigure8
SupFigure9		SupFigure9
SupplementaryTables		SupplementaryTables
images		images
.DS_Store		.DS_Store
CentromereMapped_LocationSplit.py		CentromereMapped_LocationSplit.py
CentromereMapping.sh		CentromereMapping.sh
Consensus_Script.R		Consensus_Script.R
GCLoess.R		GCLoess.R
GC_calculation.R		GC_calculation.R
GCcontent.py		GCcontent.py
GCcorrection.R		GCcorrection.R
Pi_Estimation_metadata.csv		Pi_Estimation_metadata.csv
README.md		README.md
caroli_pahari_process_fastq.sh		caroli_pahari_process_fastq.sh
centromere_kmers.sh		centromere_kmers.sh
final.set1.k31.correctedcount.txt		final.set1.k31.correctedcount.txt
k31txt.to.fastq.py		k31txt.to.fastq.py
kmer_composition.py		kmer_composition.py
process_mapped_sam.R		process_mapped_sam.R
sanger_bam_library_identifiers.sh		sanger_bam_library_identifiers.sh
sanger_split_libraries.sh		sanger_split_libraries.sh
wild_fastq.sh		wild_fastq.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Population and subspecies diversity at mouse centromeres

Scripts used in the analysis:

Downloading and processing publicly available data

Sanger mouse genomes project

Wild mouse genomes project

Mus caroli and Mus pahari genome assemblies

Processing fastq data to make k-mer tables

Mapping k-mers to centromere consensus sequence

GC correction for copy number estimation

Using k-mers mapped to centromere consensus sequence to quantify polymorphisms

Mapping reads to centromere consensus sequence to calculate the centromere diversity index

Files used for plotting

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Population and subspecies diversity at mouse centromeres

Scripts used in the analysis:

Downloading and processing publicly available data

Sanger mouse genomes project

Wild mouse genomes project

Mus caroli and Mus pahari genome assemblies

Processing fastq data to make k-mer tables

Mapping k-mers to centromere consensus sequence

GC correction for copy number estimation

Using k-mers mapped to centromere consensus sequence to quantify polymorphisms

Mapping reads to centromere consensus sequence to calculate the centromere diversity index

Files used for plotting

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages