The goal of HiCParser is to parse Hi-C data (HiCParser supports
serveral formats), and import them in R, as an InteractionSet object.
Get the latest stable R release from
CRAN. Then install HiCParser from
Bioconductor using the following code:
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("HiCParser")And the development version from GitHub with:
BiocManager::install("emaigne/HiCParser")Then load the package :
library("HiCParser")So far, HiCParser supports:
-
cool and mcool formats
-
hic format
-
HiC-Pro format
-
A tabular format, where
- the first column is named “chromosome”
- the second column is named “position 1” or “position.1”
- the third column is named “position 2” or “position.2”
- the fourth column is named “x.Ry”, and x is the id of the condition (“1”, or “2”, usually), y is the id of the replicate (“1”, “2”, “3”, etc.); it should contain matrix counts
- the remaining columns are optional, and should be formatted like the fourth column
We show here how to parse one hic format file.
hicFilePath <- system.file("extdata", "hicsample_21.hic", package = "HiCParser")
data <- parseHiC(
paths = hicFilePath,
binSize = 5000000,
conditions = 1,
replicates = 1
)Note that a hic file can include several matrices, with different bin sizes. This is why the bin size should be provided.
We show here how to parse several files (actually, the same file, several times). We suppose here that we have 2 conditions, with 3 replicates for each condition.
data <- parseHiC(
paths = rep(hicFilePath, 6),
binSize = 5000000,
conditions = rep(seq(2), each = 3),
replicates = rep(seq(3), 2)
)Currently, HiCParser supports the hic format up to the version 9.
A HiC-Pro file contains a matrix file, and a bed file. A different bed file could be use for each matrix file, but the same can also be used.
matrixFilePath <-
system.file("extdata", "hicsample_21.matrix", package = "HiCParser")
bedFilePath <-
system.file("extdata", "hicsample_21.bed", package = "HiCParser")
data <- parseHiCPro(
matrixPaths = rep(matrixFilePath, 6),
bedPaths = bedFilePath,
conditions = rep(seq(2), each = 3),
replicates = rep(seq(3), 2)
)Please note that the cool and mcool format store data in HDF5 format.
The HDF5
package
is not included by default, because it requires a substantial time to be
compiled, and many users will not need the cool/mcool parser. So, in
order to use the cool/mcool parser, you should install the rhdf5
package.
The cool format include only one bin size.
if (!"rhdf5" %in% installed.packages()) {
install.packages("rhdf5")
}
coolFilePath <- system.file("extdata",
"hicsample_21.cool",
package = "HiCParser"
)
data <- parseCool(
paths = rep(coolFilePath, 6),
conditions = rep(seq(2), each = 3),
replicates = rep(seq(3), 2)
)The mcool format may include several bin sizes. It is thus compulsory to mention it. The same function is used for the cool/mcool formats.
mcoolFilePath <- system.file("extdata",
"hicsample_21.mcool",
package = "HiCParser"
)
data <- parseCool(
paths = rep(mcoolFilePath, 6),
binSize = 5000000,
conditions = rep(seq(2), each = 3),
replicates = rep(seq(3), 2)
)A tabular file is a tab-separated multi-replicate sparse matrix with a header:
chromosome position 1 position 2 C1.R1 C1.R2 C1.R3 ...
Y 1500000 7500000 145 184 72 ...
The number of interactions between position 1 and position 2 of
chromosome are reported in each condition.replicate column. There is
no limit to the number of conditions and replicates.
To load Hi-C data in this format:
hic.experiment <- parseTabular(
system.file("extdata",
"hicsample_21.tsv",
package = "HiCParser"
),
sep = "\t"
)The output is a InteractionSet. This object can store one or several samples. Please read the corresponding vignette in order to known more about this format.
library("HiCParser")
hicFilePath <- system.file("extdata", "hicsample_21.hic", package = "HiCParser")
hic.experiment <- parseHiC(
paths = rep(hicFilePath, 6),
binSize = 5000000,
conditions = rep(seq(2), each = 3),
replicates = rep(seq(3), 2)
)
#>
#> Parsing '/tmp/RtmpHFfOT6/temp_libpathc52a1c587e54/HiCParser/extdata/hicsample_21.hic'.
#>
#> Parsing '/tmp/RtmpHFfOT6/temp_libpathc52a1c587e54/HiCParser/extdata/hicsample_21.hic'.
#>
#> Parsing '/tmp/RtmpHFfOT6/temp_libpathc52a1c587e54/HiCParser/extdata/hicsample_21.hic'.
#>
#> Parsing '/tmp/RtmpHFfOT6/temp_libpathc52a1c587e54/HiCParser/extdata/hicsample_21.hic'.
#>
#> Parsing '/tmp/RtmpHFfOT6/temp_libpathc52a1c587e54/HiCParser/extdata/hicsample_21.hic'.
#>
#> Parsing '/tmp/RtmpHFfOT6/temp_libpathc52a1c587e54/HiCParser/extdata/hicsample_21.hic'.
hic.experiment
#> class: InteractionSet
#> dim: 44 6
#> metadata(0):
#> assays(1): ''
#> rownames: NULL
#> rowData names(1): chromosome
#> colnames: NULL
#> colData names(2): condition replicate
#> type: StrictGInteractions
#> regions: 9The conditions and replicates are reported in the colData slot :
SummarizedExperiment::colData(hic.experiment)
#> DataFrame with 6 rows and 2 columns
#> condition replicate
#> <integer> <integer>
#> 1 1 1
#> 2 1 2
#> 3 1 3
#> 4 2 1
#> 5 2 2
#> 6 2 3They corresponds to columns of the assays matrix (containing
interactions values):
head(SummarizedExperiment::assay(hic.experiment))
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 79 79 79 79 79 79
#> [2,] 22 22 22 22 22 22
#> [3,] 3 3 3 3 3 3
#> [4,] 1 1 1 1 1 1
#> [5,] 1 1 1 1 1 1
#> [6,] 2 2 2 2 2 2The positions of interactions are in the interactions slot of the
object:
InteractionSet::interactions(hic.experiment)
#> StrictGInteractions object with 44 interactions and 1 metadata column:
#> seqnames1 ranges1 seqnames2 ranges2 | chromosome
#> <Rle> <IRanges> <Rle> <IRanges> | <Rle>
#> [1] 21 5000001-10000000 --- 21 5000001-10000000 | 21
#> [2] 21 5000001-10000000 --- 21 10000001-15000000 | 21
#> [3] 21 5000001-10000000 --- 21 15000001-20000000 | 21
#> [4] 21 5000001-10000000 --- 21 20000001-25000000 | 21
#> [5] 21 5000001-10000000 --- 21 25000001-30000000 | 21
#> ... ... ... ... ... ... . ...
#> [40] 21 35000001-40000000 --- 21 40000001-45000000 | 21
#> [41] 21 35000001-40000000 --- 21 45000001-50000000 | 21
#> [42] 21 40000001-45000000 --- 21 40000001-45000000 | 21
#> [43] 21 40000001-45000000 --- 21 45000001-50000000 | 21
#> [44] 21 45000001-50000000 --- 21 45000001-50000000 | 21
#> -------
#> regions: 9 ranges and 1 metadata column
#> seqinfo: 1 sequence from an unspecified genome; no seqlengthsBelow is the citation output from using citation('HiCParser') in R.
Please run this yourself to check for any updates on how to cite
HiCParser.
To cite the ‘HiCParser’ HiCParser in a publication, use :
Maigné E, Zytnicki M (2024). A multiple format Hi-C data parser. doi:10.18129/B9.bioc.HiCParser https://doi.org/10.18129/B9.bioc.HiCParser, https://github.com/emaigne/HiCParser/HiCParser - R package version 0.1.0, http://www.bioconductor.org/packages/HiCParser.
As a BibTeX entry :
@Manual{hicparser,
title = {A multiple format Hi-C data parser},
author = {Elise Maigné and Matthias Zytnicki},
year = {2024},
url = {http://www.bioconductor.org/packages/HiCParser},
note = {https://github.com/emaigne/HiCParser/HiCParser - R package version 0.1.0},
doi = {10.18129/B9.bioc.HiCParser},
}
Please note that the HiCParser was only made possible thanks to many
other R and bioinformatics software authors, which are cited either in
the vignettes and/or the paper(s) describing this package.
Please note that the HiCParser project is released with a Contributor
Code of Conduct. By
contributing to this project, you agree to abide by its terms.
- Continuous code testing is possible thanks to GitHub actions through usethis, remotes, and rcmdcheck customized to use Bioconductor’s docker containers and BiocCheck.
- Code coverage assessment is possible thanks to codecov and covr.
- The documentation website is automatically updated thanks to pkgdown.
- The code is styled automatically thanks to styler.
- The documentation is formatted thanks to devtools and roxygen2.
For more details, check the dev directory.
This package was developed using biocthis.