-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathdiffHicUG.Rnw
More file actions
2114 lines (1738 loc) · 121 KB
/
diffHicUG.Rnw
File metadata and controls
2114 lines (1738 loc) · 121 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass{report}
<<style-knitr, eval=TRUE, echo=FALSE, results="asis">>=
BiocStyle::latex(relative.path=TRUE)
@
\bioctitle[\Biocpkg{diffHic} User's Guide]{\Biocpkg{diffHic}: Differential analysis of Hi-C data \\ User's Guide}
%% also: \bioctitle{Title used for both header and title page}
%% or... \title{Title used for both header and title page}
\author[1]{Aaron T. L. Lun}
\affil[1]{The Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia}
\usepackage{framed,color}
\newenvironment{combox}
{ \definecolor{shadecolor}{RGB}{255, 240, 240} \begin{shaded}\begin{center}\begin{minipage}[t]{0.95\textwidth} }
{ \end{minipage}\end{center}\end{shaded} \definecolor{shadecolor}{RGB}{240,240,240} }
\usepackage{bibentry}
\nobibliography*
\begin{document}
\maketitle
\begin{abstract}
This document contains instructions on how to use \Biocpkg{diffHic} for differential interaction analyses of Hi-C data.
It covers counting of read pairs into bin pairs, filtering out low-abundance bin pairs, normalization of sample-specific trended biases,
variance modelling and hypothesis testing, and summarization and visualization of bin pairs.
\vspace{0.05in}
\textit{First edition:} 12 December 2012 \\
\textit{Last revised:} 10 October 2017 \\
\textit{Last compiled:} \Sexpr{format(Sys.Date(), "%d %B %Y")}
\end{abstract}
\packageVersion{\Sexpr{BiocStyle::pkg_ver("diffHic")}}
\tableofcontents
\newpage
<<results="hide",echo=FALSE>>=
picdir <- "plots-ug/"
if (file.exists(picdir)) { unlink(picdir, recursive=TRUE) }
dir.create(picdir)
knitr::opts_chunk$set(error=FALSE, warning=FALSE, message=FALSE)
knitr::opts_chunk$set(fig.path=picdir, dev='png', dpi=150)
@
\chapter{Introduction}
\section{Scope}
This document describes the differential analysis of Hi-C data with the \Biocpkg{diffHic} package.
Differential interactions (DIs) are defined as interactions with significant changes in intensity between experimental conditions.
DIs are identified in a statistically rigorous manner using methods in the \Biocpkg{edgeR} package \cite{edgeR}.
Knowledge of \Biocpkg{edgeR} is useful but is not necessary for this guide.
\section{How to get help}
Most questions about individual functions should be answered by the documentation.
For example, if you want to know more about \Rfunction{preparePairs}, you can bring up the documentation by typing \Rcode{?preparePairs} or \Rcode{help(preparePairs)} at the \R{} prompt.
Further detail on the methods or the theory can be found in the references at the bottom of each help page.
The entire \Biocpkg{diffHic} pipeline is also described in its own publication \cite{lun2015diffhic}.
The authors of the package always appreciate receiving reports of bugs in the package functions or in the documentation.
The same goes for well-considered suggestions for improvements.
Other questions about how to use \Biocpkg{diffHic} are best sent to the Bioconductor support site at \url{https://support.bioconductor.org}.
Please send requests for general assistance and advice to the support site, rather than to the individual authors.
Users posting to the support site for the first time may find it helpful to read the posting guide at \url{http://www.bioconductor.org/help/support/posting-guide}.
\section{A brief description of Hi-C}
The Hi-C protocol \cite{lieberman2009comprehensive} is used to study chromatin organization by identifying pairwise interactions between two distinct genomic loci.
Briefly, chromatin is cross-linked and digested with a restriction enzyme.
This releases chromatin complexes into solution, where each complex contains multiple restriction fragments corresponding to interacting loci.
Overhangs are filled in with biotin-labelled nucleotides to form blunt ends.
Proximity ligation is then performed, which favors ligation between blunt ends in the same complex.
The ligated DNA is sonicated and the sheared fragments containing ligation junctions are purified by a streptavidin pulldown.
Paired-end sequencing is performed on the purified ligation products, and pairs of interacting loci are identified by mapping the paired reads.
Of course, some caution is required due to the presence of non-specific ligation between blunt ends in different complexes.
\section{Quick start}
A typical differential analysis of Hi-C data is described below.
For simplicity, assume that the BAM files have already been processed into ``index'' files in \Robject{input}.
Let \Robject{design} contain the design matrix for this experiment.
Also assume that the boundaries of the relevant restriction fragments are present in \Robject{fragments}.
The code itself is split across several steps:
<<echo=FALSE>>=
input <- c("merged_flox_1.h5", "merged_flox_2.h5", "merged_ko_1.h5", "merged_ko_2.h5")
fragments <- readRDS("mm10-hindIII.rds")
design <- model.matrix(~factor(c("flox", "flox", "ko", "ko")))
@
% saveRDS(fragments, file="mm10-hindIII.rds")
\begin{enumerate}
\item Converting BAM files to index files.
\item Counting read pairs into pairs of genomic bins.
<<>>=
library(diffHic)
param <- pairParam(fragments=fragments)
data <- squareCounts(input, width=1e6, param=param)
@
\item Filtering out uninteresting bin pairs.
<<>>=
library(edgeR)
keep <- aveLogCPM(asDGEList(data)) > 0
data <- data[keep,]
@
\item Normalizing for sample-specific biases.
<<>>=
data <- normOffsets(data, type="loess", se.out=TRUE)
y <- asDGEList(data)
@
\item Modelling biological variability between replicates.
<<>>=
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design, robust=TRUE)
@
\item Testing for significant differences between groups.
<<>>=
result <- glmQLFTest(fit)
clusters <- diClusters(data, result$table, target=0.05,
cluster.args=list(tol=1))
@
\end{enumerate}
In the various examples for this guide, data will be used from three studies.
The first dataset examines the chromatin structure in K562 and GM06990 cell lines \cite{lieberman2009comprehensive}.
The second compares interaction intensities between wild-type and cohesin-deficient murine neural stem cells \cite{sofueva2013cohesin}.
The final study compares ERG-overexpressing RWPE1 cells with a GFP-expressing control \cite{rickman2012oncogene}.
Obviously, readers will have to modify the code for their own analyses.
\chapter{Preparing index files from BAM files}
\label{chap:prep}
\begin{combox}
This box will appear at the start of each chapter, describing the objects required from previous chapters.
As we're starting out here, we don't really need anything.
\end{combox}
\section{A comment on aligning Hi-C libraries}
\label{sec:align}
In a typical Hi-C library, sequencing will occasionally be performed over the ligation junction between two restriction fragments.
This forms a chimeric read that contains sequences from two distinct genomic loci.
Here, correct alignment of the 5$'$ end of the chimeric read is more important.
This is because the location of the 3$'$ end is already provided by mapping the 5$'$ end of the mate read.
Direct application of alignment software will not be optimal as only one mapping location will be usually reported for each read.
This means that the 5$'$ location will be ignored if the 3$'$ alignment is superior, e.g., longer or fewer mismatches.
Instead, chimeric alignment can be performed with approaches like iterative mapping \cite{imakaev2012iterative} or read splitting \cite{seitan2013cohesin}.
In the latter, we define the ligation signature as the sequence that is obtained after ligation between blunt ends derived from cleaved restriction sites.
For example, the \textit{Hin}dIII enzyme cleaves at \Rcode{AAGCTT} with a 4 bp overhang.
This yields a signature sequence of \Rcode{AAGCTAGCTT} upon ligation of blunt ends.
The ligation signature in each chimeric read is identified with \software{cutadapt} \cite{martin2011cutadapt}, and the read is split into two segments at the center of the signature.
Each segment is then aligned separately to the reference genome using \software{Bowtie2} \cite{langmead2012bowtie}.
Mapping by read splitting is performed in \Biocpkg{diffHic} using a custom Python script:
<<results='hide'>>=
system.file("python", "presplit_map.py", package="diffHic", mustWork=TRUE)
@
Users are strongly recommended to synchronize mate pair information and to mark duplicate read pairs in the resulting BAM file.
This can be achieved using the various tools in the \software{Picard} software suite (\url{http://broadinstitute.github.io/picard}).
\section{Matching mapped reads to restriction fragments}
The Hi-C protocol is based on ligation between restriction fragments.
Sequencing of the ligation product is performed to identify the interacting loci -- or, more precisely, the two restriction fragments containing the interacting loci.
The resolution of Hi-C data is inherently limited by the frequency of restriction sites and the size of the restriction fragments.
Thus, it makes sense to report the read alignment location in terms of the restriction fragment to which that read was mapped.
The boundaries of each restriction fragment can be obtained with the \Rfunction{cutGenome} function, as shown below for the human genome after digestion with the \textit{Hin}dIII restriction enzyme (recognition site of \Rcode{AAGCTT}, 5$'$ overhang of 4 bp).
<<>>=
library(BSgenome.Hsapiens.UCSC.hg19)
hs.frag <- cutGenome(BSgenome.Hsapiens.UCSC.hg19, "AAGCTT", 4)
hs.frag
@
These fragments should be stored in a \Rclass{pairParam} object.
The constructor below checks that the fragment ranges are properly ordered.
Later, this object will also hold other parameters for counting.
This simplifies coordination of the various steps in the \Biocpkg{diffHic} pipeline, as the same \Rclass{pairParam} object can be easily passed between different functions.
<<>>=
hs.param <- pairParam(hs.frag)
hs.param
@
The \Rfunction{preparePairs} function matches the mapping location of each read to a restriction fragment in the reference genome.
Mapping results for paired reads should be provided in a name-sorted BAM file.
The function converts the read position into an index, pointing to the matching restriction fragment in \Robject{hs.frag}.
The resulting pairs of indices are stored in an index file using the HDF5 format.
The fragments to which the reads mapped are referred to as ``anchors'' -- the larger index is defined as the first anchor fragment whereas the smaller is the second.
This process is demonstrated using Hi-C data generated from GM06990 cells.
<<>>=
diagnostics <- preparePairs("SRR027957.bam", hs.param, file="SRR027957.h5",
dedup=TRUE, minq=10)
names(diagnostics)
@
The function itself returns a list of diagnostics showing the number of read pairs that are lost for various reasons.
Of particular note is the removal of reads that are potential PCR duplicates with \Rcode{dedup=TRUE}.
This requires marking of the reads beforehand using an appropriate program such as \software{MarkDuplicates} from the \software{Picard} suite.
Filtering on the minimum mapping quality score with \Rcode{minq} is also recommended to remove spurious alignments.
<<>>=
diagnostics$pairs
@
Read pairs mapping to the same restriction fragment provide little information on interactions between fragments \cite{belton2012hic}.
Dangling ends are inward-facing read pairs that are mapped to the same fragment.
These are uninformative as they are usually formed from sequencing of the restriction fragment prior to ligation.
Self-circles are outward-facing read pairs that are formed when two ends of the same restriction fragment ligate to one another.
Interactions within a fragment cannot be easily distinguished from these self-circularization events.
Both structures are removed to avoid confusion in downstream steps.
<<>>=
diagnostics$same.id
@
\section{Processing of chimeric reads}
In the BAM file, chimeric reads are represented by two separate alignments for each read.
Hard clipping is used to denote the length trimmed from each sequence in each alignment, and to determine which alignment corresponds to the 5$'$ or 3$'$ end of the read.
Only the 5$'$ end(s) is used to determine the restriction fragment index for that read pair.
The total number of chimeric read pairs will be reported by \Rfunction{preparePairs}, along with the number of pairs where both 5$'$ ends are mapped (\Rcode{mapped}) and the nuber where both 5$'$ ends and at least one 3$'$ end are mapped (\Rcode{multi}).
Of course, the function will also work if the mapping location is only given for the 5$'$ end, though chimeric statistics will not be properly computed.
<<>>=
diagnostics$chimeras
@
The proportion of invalid chimeric pairs is also calculated by \Rfunction{preparePairs}.
Invalid pairs are those where the 3$'$ location of a chimeric read disagrees with the 5$'$ location of the mate.
The invalid proportion can be used as an empirical measure of the mapping error rate - or, at least, the upper bound thereof, given that error rates are likely to be lower for longer, non-chimeric alignments.
High error rates may be indicative of a fault in the mapping pipeline.
<<>>=
diagnostics$chimeras[["invalid"]]/diagnostics$chimeras[["multi"]]
@
Invalid chimeric pairs can be discarded by setting \Rcode{ichim=FALSE} in \Rfunction{preparePairs}.
However, this is not recommended for routine analyses.
Mapping errors for short 3$'$ ends may result in apparent invalidity and loss of the (otherwise correct) 5$'$ alignments.
\section{Filtering artifactual read pairs}
\subsection{Reprocessing index files for quality control}
The \Rfunction{prunePairs} function removes read pairs that correspond to artifacts in the Hi-C procedure.
The returned vector contains the number of read pairs removed for each type of artifact.
Values of \Rcode{length}, \Rcode{inward} and \Rcode{outward} correspond to removal by \Rcode{max.frag}, \Rcode{min.inward} and \Rcode{min.outward}, respectively.
Retained read pairs are stored in another index file for later use.
<<>>=
min.inward <- 1000
min.outward <- 25000
prunePairs("SRR027957.h5", hs.param, file.out="SRR027957_trimmed.h5",
max.frag=600, min.inward=min.inward, min.outward=min.outward)
@
The \Rcode{max.frag} argument removes read pairs where the inferred length of the sequencing fragment (i.e., the ligation product) is greater than a specified value.
The length of the sequencing fragment is computed by summing, for both reads, the distance between the mapping location of the 5$'$ end and the nearest restriction site on the 3$'$ side.
Excessively large lengths are indicative of off-site cleavage, i.e., where the restriction enzyme or some other agent cuts the DNA at a location other than the restriction site.
While not completely uninformative, these are discarded as they are not expected from the Hi-C protocol.
The threshold value can be chosen based on the size selection interval in library preparation, or by examining the distribution of inferred fragment lengths from \Rfunction{getPairData}.
<<fraglenhist, fig.small=TRUE>>=
diags <- getPairData("SRR027957.h5", hs.param)
hist(diags$length[diags$length < 1000], ylab="Frequency",
xlab="Spacing (bp)", main="", col="grey80")
@
The insert size is defined as the linear distance between two paired reads on the same chromosome.
The \Rcode{min.inward} paramater removes inward-facing intra-chromosomal read pairs where the insert size is less than the specified value.
The \Rcode{min.outward} parameter does the same for outward-facing intra-chromosomal read pairs.
This is designed to remove dangling ends or self-circles involving DNA fragments that have not been completely digested \cite{jin2013highres}.
Such read pairs are technical artifacts that are (incorrectly) retained by \Rfunction{preparePairs}, as the two reads involved are mapped to different restriction fragments.
Appropriate thresholds for both parameters can be determined using strand orientation plots.
\subsection{Setting size thresholds with strand orientation plots}
\label{sec:strorient}
The strand orientation for a read pair refers to the combination of strands for the two alignments.
These are stored as flags where setting \Rcode{0x1} or \Rcode{0x2} means that the read on the first or second anchor fragment, respectively, is mapped on the reverse strand.
If different pieces of DNA were randomly ligated together, one would expect to observe equal proportions of all strand orientations.
This can be tested by examining the distribution of strand orientations for inter-chromosomal read pairs.
Each orientation is equally represented across these read pairs, which is expected as different chromosomes are distinct pieces of DNA.
<<>>=
intra <- !is.na(diags$insert)
table(diags$orientation[!intra])
@
This can be repeated for intra-chromosomal read pairs, by plotting the distribution of insert sizes for each strand orientation \cite{jin2013highres}.
The two same-strand distributions are averaged for convenience.
At high insert sizes, the distributions will converge for all strand orientations.
This is consistent with random ligation between two separate restriction fragments.
At lower insert sizes, spikes are observed in the ouward- and inward-facing distributions due to self-circularization and dangling ends, respectively.
Thresholds should be chosen in \Rfunction{prunePairs} to remove these spikes, as represented by the grey lines.
<<strorient, fig.small=TRUE>>=
# Constructing the histograms for each orientation.
llinsert <- log2(diags$insert + 1L)
intra <- !is.na(llinsert)
breaks <- seq(min(llinsert[intra]), max(llinsert[intra]), length.out=30)
inward <- hist(llinsert[diags$orientation==1L], plot=FALSE, breaks=breaks)
outward <- hist(llinsert[diags$orientation==2L] ,plot=FALSE, breaks=breaks)
samestr <- hist(llinsert[diags$orientation==0L | diags$orientation==3L],
plot=FALSE, breaks=breaks)
samestr$counts <- samestr$counts/2
# Setting up the axis limits.
ymax <- max(inward$counts, outward$counts, samestr$counts)/1e6
xmax <- max(inward$mids, outward$mids, samestr$mids)
xmin <- min(inward$mids, outward$mids, samestr$mids)
# Making a plot with all orientations overlaid.
plot(0,0,type="n", xlim=c(xmin, xmax), ylim=c(0, ymax),
xlab=expression(log[2]~"[insert size (bp)]"), ylab="Frequency (millions)")
lines(inward$mids, inward$counts/1e6, col="darkgreen", lwd=2)
abline(v=log2(min.inward), col="darkgrey")
lines(outward$mids, outward$counts/1e6, col="red", lwd=2)
abline(v=log2(min.outward), col="darkgrey", lty=2)
lines(samestr$mids, samestr$counts/1e6, col="blue", lwd=2)
legend("topright", c("inward", "outward", "same"),
col=c("darkgreen", "red", "blue"), lwd=2)
@
As an aside, the position of the spikes in the above plots can be used to estimate some fragment lengths.
The $x$-coordinate of the outward-facing spike represents the average length of the DNA fragments after restriction digestion.
This is useful as it provides a lower bound on the spatial resolution of any given Hi-C experiment.
The position of the inward-facing spike represents the average length of the fragments after sonication.
This should be lower than the size selection thresholds used in library preparation.
\section{Merging technical replicates}
Hi-C experiments often involve deep sequencing as read pairs are sparsely distributed across the set of possible interactions.
As a result, multiple index files may be generated from multiple technical replicates of a single Hi-C library.
These can be merged together using the \Rfunction{mergePairs} function prior to downstream processing.
This is equivalent to summing the counts for each pair of restriction fragment indices, and is valid if one assumes Poisson sampling for each sequencing run \cite{marioni2008rnaseq}.
An example is provided below that merges several technical replicates for a Hi-C library generated from GM06990 cells \cite{lieberman2009comprehensive}.
<<>>=
prepped <- preparePairs("SRR027958.bam", hs.param, file="SRR027958.h5",
dedup=TRUE, minq=10)
counted <- prunePairs("SRR027958.h5", hs.param, file.out="SRR027958_trimmed.h5",
max.frag=600, min.inward=min.inward,
min.outward=min.outward)
mergePairs(files=c("SRR027957_trimmed.h5", "SRR027958_trimmed.h5"), "merged.h5")
@
In addition, any Hi-C dataset that is processed manually by the user can be stored in an index file using the \Rfunction{savePairs} function.
This takes a dataframe with the first and second anchor indices, as well as any additional information that might be useful.
The idea is to provide an entry point into the \Biocpkg{diffHic} analysis from other pipelines.
If the dataset is too large, one can save chunks at a time before merging them all together with \Rfunction{mergePairs}.
<<>>=
anchor1.id <- as.integer(runif(100, 1, length(hs.param$fragments)))
anchor2.id <- as.integer(runif(100, 1, length(hs.param$fragments)))
dummy <- data.frame(anchor1.id, anchor2.id, other.data=runif(100))
savePairs(dummy, "example.h5", hs.param)
@
For full compatibility, users should include the alignment positions and lengths for each read pair as \Rcode{anchorX.pos} and \Rcode{anchorX.len},
where \Rcode{X} is 1 or 2 for each of the paired reads.
The alignment position refers to the 1-based coordinate of the left-most base of the alignment.
The alignment length refers to the span of the alignment relative to the reference, and should be negative for alignments on the reverse strand.
This information will be used in downstream \Biocpkg{diffHic} functions, such as read counting around blacklisted regions.
\section{Handling DNase Hi-C experiments}
DNase Hi-C is a variant of the standard protocol whereby the DNase~I enzyme is used to fragment the genome instead of restriction enzymes \cite{ma2015dnase}.
Random fragmentation provides resolution beyond that of individual restriction fragments.
However, cleavage sites for DNase~I cannot be easily predicted to construct \Rcode{param\$fragments}.
Fragment indices have no meaning here because there are no restriction fragments for reads to be assigned to.
Instead, \Biocpkg{diffHic} handles this type of data by operating directly on the alignment position of each read.
This reflects the theoretical base-pair resolution of the data during quantification.
To indicate that the reads come from a DNase Hi-C experiment, an empty \Rclass{GRanges} object should be supplied as the \Rcode{fragments} in the \Rclass{pairParam} object.
Most \Biocpkg{diffHic} functions will detect this and behave appropriately to exploit the improved resolution.
An example of this approach is shown below, where the SRR027957 library is treated as a DNase Hi-C sample.
<<>>=
seg.frags <- emptyGenome(BSgenome.Hsapiens.UCSC.hg19)
preparePairs("SRR027957.bam", pairParam(seg.frags), file="SRR027957.h5",
dedup=TRUE, minq=10)
@
Unlike \Rfunction{preparePairs}, no diagnostic information regarding self-circles or dangling ends is reported here.
Such metrics are irrelevant when restriction fragments are not involved.
Similarly, the \Rcode{length} field in the output of \Rfunction{getPairData} is set to \Rcode{NA} and should be ignored.
This is because the length of the sequencing fragment cannot be generally computed without knowledge of the fragmentation sites.
As a result, the \Rcode{max.frag} argument should also be set to \Rcode{NA} in \Rfunction{prunePairs}.
Metrics for inward- and outward-facing read pairs are unaffected, as these are computed independently of the fragments.
See the documentation for individual functions to determine if/how their behaviour changes for DNase Hi-C data.
Users should also note that the alignment script described in Section~\ref{sec:align} is not appropriate for DNase Hi-C experiments.
This approach is based on splitting chimeric reads at the ligation signature, which is constructed from the recognition site of a restriction enzyme.
The sequence around the ligation junction is not well-defined when DNase~I is used for cleavage.
Instead, alignment programs should be used that can directly handle chimeric reads with arbitrary breakpoints in the genome, e.g., \software{BWA} \cite{li2010fast}.
A Python script for iterative mapping \cite{imakaev2012iterative} is also provided for this purpose.
<<results='hide'>>=
system.file("python", "iter_map.py", package="diffHic", mustWork=TRUE)
@
\chapter{Counting read pairs into interactions}
\label{chap:counting}
\begin{combox}
A different dataset will be used here, so we don't need anything from the last chapter.
Horses for courses; this dataset's a lot nicer for detecting differential interactions.
\end{combox}
\section{Overview}
Prior to any statistical analysis, the read pairs in a Hi-C library must be summarized into a count for each interaction.
This count is used as an experimental measure of the interaction intensity.
Specifically, each pairwise interaction is parameterized by two genomic intervals representing the interacting loci.
The count for that interaction is defined as the number of read pairs with one read mapping to each of the intervals.
Counting is performed for each sample in the dataset, such that each interaction is associated with a set of counts.
The interaction space is defined as the genome-by-genome space over which the read pairs are distributed.
Recall that each paired read is assigned to a restriction fragment index.
The interaction space contains all index pairs $(x, y)$ for $x, y \in [1 .. N]$, where $x \ge y$ and $N$ is the number of restriction fragments in the genome.
This can be visualized as the triangular space between $y=x$, $y=0$ and $x=N$ on the Cartesian plane.
A rectangular area in the interaction space represents a pairwise interaction between the genomic intervals spanned by the two adjacent sides of that rectangle.
The number of read pairs in this area is used as the count for the corresponding interaction.
Non-rectangular areas can also represent interactions, but these are more difficult to interpret and will not be considered here.
The examples shown here will use the neural stem cell dataset \cite{sofueva2013cohesin} in which wild-type cells are compared to cohesin-deficient cells.
Read processing has already been performed to construct an index file for each library.
Some additional work is required to obtain the restriction fragment coordinates for the \textit{Hin}dIII-digested mouse genome.
<<>>=
library(BSgenome.Mmusculus.UCSC.mm10)
mm.frag <- cutGenome(BSgenome.Mmusculus.UCSC.mm10, "AAGCTT", 4)
input <- c("merged_flox_1.h5", "merged_flox_2.h5",
"merged_ko_1.h5", "merged_ko_2.h5")
@
\section{Counting into bin pairs}
\subsection{Overview}
Here, the genome is partitioned into contiguous non-overlapping bins of constant size.
Each interaction is defined as a pair of these bins.
This approach avoids the need for prior knowledge of the loci of interest when summarizing Hi-C counts.
Counting of read pairs between paired bins is performed for multiple libraries using the \Rfunction{squareCounts} function.
<<>>=
bin.size <- 1e6
mm.param <- pairParam(mm.frag)
data <- squareCounts(input, mm.param, width=bin.size, filter=1)
data
@
This generates an \Rclass{InteractionSet} object containing information for multiple genomic interactions \cite{lun2016infrastructure}.
Each row of the object corresponds to an interaction, i.e., bin pair, while each column represents a library.
For each interaction, the coordinates of its two ``anchor'' bins are returned.
The first/second anchor notation is used for these bins, whereby the first anchor bin is that with the ``higher'' genomic coordinate.
<<>>=
head(anchors(data, type="first"))
head(anchors(data, type="second"))
@
The returned \Rclass{InteractionSet} object contains a count matrix with the number of read pairs for each interaction in each library.
It also contains a \Rcode{totals} vector in the \Rcode{colData}, which stores the total number of read pairs for each library.
<<>>=
head(assay(data))
data$totals
@
Bin pairs can also be filtered to remove those with to a count sum below \Rcode{filter}.
This removes uninformative bin pairs with very few read pairs, and reduces the memory footprint of the function.
A higher value of \Rcode{filter} may be necessary for analyses of large datasets in limited memory.
More sophisticated filtering strategies are discussed in Chapter~\ref{chap:filter}.
\subsection{Choosing a bin width}
The \Rcode{width} of the bin is specified in base pairs and determines the spatial resolution of the analysis.
Smaller bins will have greater spatial resolution as adjacent features can be distinguished in the interaction space.
Larger bins will have greater counts as a larger area is used to collect read pairs.
Optimal summarization will not be achieved if bins are too small or too large to capture the (changes in) intensity of the underlying interactions.
For this analysis, 1 Mbp bins are used to capture broad structural features.
This is also useful for demonstration purposes, as the counts are large enough for clear manifestation of biases in later chapters.
Of course, \Biocpkg{diffHic} is more than capable of handling smaller sizes (e.g., below 20 kbp) for higher-resolution analyses of looping interactions between specific elements.
In such cases, \Rcode{filter} should be increased to avoid excessive memory usage.
<<>>=
head(regions(data))
@
The boundary of each bin is rounded to the closest restriction site in \Rfunction{squareCounts}.
This is due to the inherent limits on spatial resolution in a Hi-C experiment.
The number of restriction fragments in each bin is recorded in the \Rcode{nfrags} field of the metadata.
Determination of the ideal bin size is not trivial as the features of interest are not usually known in advance.
Instead, repeated analyses with multiple bin sizes are recommended.
This provides some robustness to the choice of bin size.
Sharp interactions can be detected by pairs of smaller bins while diffuse interactions can be detected by larger bin pairs.
See Section~\ref{sec:mergebins} for more information on consolidating results from multiple bin sizes.
\section{Counting with pre-defined regions}
\subsection{Connecting counts between pairs of regions}
For some studies, prior knowledge about the regions of interest may be available.
For example, a researcher may be interested in examining interactions between genes.
The coordinates can be obtained from existing annotation, as shown below for the mouse genome.
Other pre-specified regions can also be used, e.g., known enhancers or protein binding sites.
<<>>=
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
gene.body <- genes(TxDb.Mmusculus.UCSC.mm10.knownGene)
strand(gene.body) <- "*" # Removing strand info, for simplicity.
@
Counting is directly performed for these defined regions using the \Rfunction{connectCounts} function.
Interactions are defined between each pair of regions in the pre-specified set.
This may be easier to interpret than pairs of bins if the interacting regions have some biological significance.
The count matrix and the vector of totals are defined as previously described.
<<>>=
redata <- connectCounts(input, mm.param, regions=gene.body)
redata
@
Again, first/second anchor notation applies whereby the interval with the larger start coordinate in the genome is defined as the first anchor.
Note that the anchor may not have a larger end coordinate if the supplied \Rcode{regions} are nested.
In addition, each region is rounded to the nearest restriction site.
Resorting is also performed, though the indices of the original regions can be found in the metadata as \Rcode{original} if back-referencing is necessary.
<<>>=
head(regions(redata))
@
One obvious limitation of this approach is that interactions involving unspecified regions will be ignored.
This is obviously problematic when searching for novel interacting loci.
Another issue is that the width of the regions cannot be easily changed.
This means that the compromise between spatial resolution and count size cannot be tuned.
For example, interactions will not be detected around smaller genes as the counts will be too small.
Conversely, interactions between distinct loci within a large gene body will not be resolved.
\subsection{Connecting counts between two sets of regions}
The \Rfunction{connectCounts} function can also be used to identify interactions between two sets of regions, by specifying a value for the \Rcode{second.regions} argument.
This only considers interactions between one entry in \Rcode{regions} and another in \Rcode{second.regions}.
This differs from the standard application of the function, which would consider an interaction between any pair of entries in \Rcode{regions}.
If an integer scalar is supplied as \Rcode{second.regions}, the second set is automatically defined as contiguous bins of that size across the genome.
The use of \Rcode{second.regions} is particularly useful in cases where there are defined ``viewpoints'' of interest, e.g., 4C-seq, Capture-C.
These viewpoints can be specified in \Rcode{regions}, as shown below for a set of mock probe locations for a hypothetical Capture-C experiment \cite{hughes2014analysis}.
Specifically, the viewpoint is defined as a 100 kbp bin centred at each capture probe.
The interaction profile across the rest of the genome can then be extracted by setting \Rcode{second.regions} to some bin size.
In this case, 100 kbp bins are used.
<<>>=
probe.loc <- GRanges(c("chr1", "chr2", "chr3"),
IRanges(c(1e7, 2e7, 3e7), c(1e7, 2e7, 3e7)))
viewpoints <- resize(probe.loc, fix="center", width=1e5)
viewdata <- connectCounts(input, mm.param, regions=viewpoints,
second.regions=1e5)
head(anchors(viewdata, type="first"))
head(anchors(viewdata, type="second"))
@
As these results demonstrate, interactions are only considered if exactly one interacting locus is from the specified \Rcode{regions}.
The identity of the other locus (i.e., from \Rcode{second.regions}) can be determined based on the \Rcode{is.second} field in the \Rclass{GRanges} object.
This approach avoids loading irrelevant interactions when only specific viewpoints are of interest.
\section{Counting into single bins}
\label{sec:marginal}
The \Rfunction{marginalCounts} function counts the number of reads mapped inside each bin.
This effectively treats each Hi-C library as single-end data to quantify the genomic coverage of each bin.
One can use these ``marginal'' counts to determine whether there are systematic differences in coverage between libraries for a given bin.
This implies that copy number variations are present, which may confound detection of differential interactions.
<<>>=
margin.data <- marginCounts(input, mm.param, width=bin.size)
margin.data
@
Each row of the \Rclass{RangedSummarizedExperiment} contains the marginal read counts for a genomic bin.
For this dataset, there are no major changes in coverage for the vast majority of bins.
The most extreme events occur at low abundances and are unlikely to be precise.
This suggests that a direct comparison of interaction intensities will be valid.
Remedial action in the presence of copy number changes is not trivial and will be discussed in Section~\ref{sec:copy}.
<<mamargin, fig.small=TRUE>>=
adjc <- cpm(asDGEList(margin.data), log=TRUE, prior.count=5)
smoothScatter(0.5*(adjc[,1]+adjc[,3]), adjc[,1]-adjc[,3],
xlab="A", ylab="M", main="Flox (1) vs. Ko (1)")
@
\section{Additional parameter options}
\subsection{Restricting the input chromosomes}
\label{sec:restrictchr}
Users can restrict counting to particular chromosomes by setting the \Rcode{restrict} slot in the \Rclass{pairParam} object.
This is useful to ensure that only interactions between relevant chromosomes are loaded.
Sequences such as the mitochondrial genome, unassigned contigs or random chromosome segments can be ignored in routine analyses.
<<>>=
new.param <- reform(mm.param, restrict=c("chr1", "chr2"))
new.param
@
In addition, if \Rcode{restrict} is a $n$-by-2 matrix, count loading will be limited to the read pairs that are mapped between the $n$ specified pairs of chromosomes.
The example below considers all read pairs mapped between chromosomes 2 and 19.
This feature is useful when memory is limited, as each pair of chromosomes can be loaded and analyzed separately.
<<>>=
new.param <- reform(mm.param, restrict=cbind("chr2", "chr19"))
new.param
@
\subsection{Specifying regions to ignore}
Users can also discard alignments that lie within blacklisted regions by setting the \Rcode{discard} slot.
The aim is to eliminate reads within known repeat regions.
Such regions are problematic, as reads from several repeat units in the real genome may be collated into a single representative unit in the genome build.
This results in a sharp, spurious spike in interaction intensity.
The problem is exacerbated by different repeat copy numbers between biological conditions, resulting in spurious differential interactions due to changes in coverage.
Removal of reads in these repeats may be necessary to obtain reliable results.
<<>>=
dummy.repeat <- GRanges("chr1", IRanges(10000, 1000000))
new.param <- reform(mm.param, discard=dummy.repeat)
new.param
@
Coordinates of annotated repeats can be obtained from several different sources.
A curated blacklist of problematic regions is available from the ENCODE project \cite{dunham2012encode}, and can be obtained at \url{https://sites.google.com/site/anshulkundaje/projects/blacklists}.
This list is constructed empirically from the ENCODE datasets and includes obvious offenders like telomeres, microsatellites and some rDNA genes.
Alternatively, repeats can be predicted from the genome sequence using software like RepeatMasker.
These calls are available from the UCSC website (e.g., \url{hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/chromOut.tar.gz} for mouse) or they can be extracted from an appropriate masked \Rclass{BSgenome} object.
Experience suggests that the ENCODE blacklist is generally preferable.
Repeat predictions tend to be aggressive such that too much of the genome (and interactions therein) will be discarded.
\subsection{Capping the read pairs per restriction fragment pair}
Incomplete removal of PCR duplicates or read pairs in repeat regions may result in spikes of read pairs within the interaction space.
The effect of these artifacts can be mitigated by capping the number of read pairs associated with each pair of restriction fragments.
This is done by specifying a value for the \Rcode{cap} slot.
Diffuse interactions should not be affected, as the associated read pairs will be distributed sparsely across many fragment pairs.
More caution is required if sharp interactions are present, i.e., interactions between 5 - 10 kbp regions.
<<>>=
new.param <- reform(mm.param, cap=5)
new.param
@
\section{Loading counts from existing count matrices}
It is possible to use counts from existing count matrices, by using coercion methods available to \Rclass{ContactMatrix} objects from the \Biocpkg{InteractionSet} package \cite{lun2016infrastructure}.
Assume that there are two \Rclass{ContactMatrix} objects (\Robject{cm1} and \Robject{cm2}) spanning the same region of the binned interaction space.
The \Rfunction{mergeCMs} function converts these two objects into a single \Rclass{InteractionSet} object for use in the \Biocpkg{diffHic} analysis pipeline.
Only bin pairs with a count sum greater than 10 are retained, analogous to the output obtained with a \Rfunction{squareCounts} call with \Rcode{filter=10}.
<<echo=FALSE,results="hide">>=
# Just for testing.
N <- 30
all.starts <- round(runif(N, 1, 100))
all.ends <- all.starts + round(runif(N, 5, 20))
all.regions <- GRanges(rep(c("chrA", "chrB"), c(N-10, 10)),
IRanges(all.starts, all.ends))
Nr <- 10
Nc <- 20
all.anchor1 <- sample(N, Nr)
all.anchor2 <- sample(N, Nc)
counts <- matrix(rpois(Nr*Nc, lambda=10), Nr, Nc)
cm1 <- ContactMatrix(counts, all.anchor1, all.anchor2, all.regions)
counts <- matrix(rpois(Nr*Nc, lambda=10), Nr, Nc)
cm2 <- ContactMatrix(counts, all.anchor1, all.anchor2, all.regions)
@
<<>>=
data.com <- mergeCMs(cm1, cm2, filter=10)
@
This procedure facilitates input from other Hi-C data processing pipelines that return count matrices rather than BAM files.
It is easily extended to datasets involving more than two samples by simply including additional \Rclass{ContactMatrix} objects in the \Rfunction{mergeCMs} call.
\section{Summary}
Counting into bin pairs is the most general method for quantifying interaction intensity.
It does not require any prior knowledge regarding the regions of interest.
The bin size can be easily adjusted to obtain the desired spatial resolution.
It is also easier/safer to compare between bin pairs (e.g., during filtering) when each bin is roughly of the same size.
Thus, bin-based counting will be the method of choice for the rest of this guide.
For simplicity, all counting steps will be performed here with the default settings, i.e., no values for \Rcode{restrict}, \Rcode{discard} or \Rcode{cap}.
However, users are encouraged to use non-default values if it can improve the outcome of their analyses.
Non-default values should be recorded in a single \Rclass{pairParam} object for consistent use across all functions in the \Biocpkg{diffHic} pipeline.
This ensures that the same read pairs will always be extracted from each index file.
\chapter{Filtering out uninteresting interactions}
\label{chap:filter}
\begin{combox}
Here, we'll need the \Robject{data} object that was loaded in the previous chapter.
We'll also need \Robject{bin.size} and \Robject{mm.param}, as well as the file names in \Robject{input}.
\end{combox}
\section{Overview}
\subsection{Computing the average abundance in a NB model}
Filtering is often performed to remove uninteresting features in analyses of high-throughput experiments.
This reduces the severity of the multiplicity correction and increases power among the remaining tests.
The filter statistic should be independent of the $p$-value under the null hypothesis, but correlated to the $p$-value under the alternative \cite{bourgon2010independent}.
The aim is to enrich for false nulls without affecting type I error for the true nulls.
Assume that the counts for each bin pair are sampled from the negative binomial (NB) distribution.
In this model, the overall NB mean across all libraries is (probably) an independent filter statistic.
The log-mean-per-million is known as the average abundance and can be computed with the \Rfunction{aveLogCPM} function in \Biocpkg{edgeR} \cite{mccarthy2012glm}.
<<avehistplot, fig.small=TRUE>>=
library(edgeR)
ave.ab <- aveLogCPM(asDGEList(data))
hist(ave.ab, xlab="Average abundance", col="grey80", main="")
@
Any bin pair with an average abundance less than a specified threshold value will be discarded.
At the very least, the threshold should be chosen to filter out bin pairs with very low absolute counts.
This is because these bin pairs will never have sufficient evidence to reject the null hypothesis.
The discreteness of low counts will also reduce the accuracy of approximations that are used in normalization and statistical modelling.
The example below identifies those bin pairs where the overall NB mean across all samples is not less than 5.
<<>>=
count.keep <- ave.ab >= aveLogCPM(5, lib.size=mean(data$totals))
summary(count.keep)
@
The \Robject{count.keep} vector is then used to subset the \Rclass{InteractionSet} object.
The resulting \Robject{dummy} object will contain only the interesting bin pairs for downstream analysis.
The same procedure can be used for any logical or integer vector that specifies the bin pairs to be retained.
<<>>=
dummy <- data[count.keep,]
@
This count-based approach is fairly objective yet is still effective, i.e., it removes a large number of bin pairs that are likely to be uninteresting.
However, it will be less useful with greater sequencing depth where all bin pairs will have higher counts.
More sophisticated strategies can be implemented where the choice of threshold is motivated by some understanding of the Hi-C protocol.
These strategies will be described in the rest of this chapter.
\section{Directly removing low-abundance interactions}
\label{sec:direct}
\subsection{Computing the threshold from inter-chromosomal counts}
The simplest definition of an ``uninteresting'' interaction is that resulting from non-specific ligation.
These are represented by low-abundance bin pairs where no underlying contact is present to drive ligation between the corresponding bins.
Any changes in the counts for these bin pairs are uninteresting and are ignored.
In particular, the filter threshold can be defined by mandating some minimum fold change above the level of non-specific ligation.
The magnitude of non-specific ligation is empirically estimated by assuming that most inter-chromosomal contacts are not genuine.
This is reasonable given that most chromosomes are arranged in self-interacting territories \cite{bickmore2013spatial}.
The median abundance across inter-chromosomal bin pairs is used as an estimate of the non-specific ligation rate.
Here, filtering is performed to retain only those bin pairs with abundances that are at least 5-fold higher than this estimate.
This aims to remove the majority of uninteresting bin pairs.
<<>>=
direct <- filterDirect(data)
direct$threshold
direct.keep <- direct$abundances > log2(5) + direct$threshold
summary(direct.keep)
@
The \Rcode{direct.keep} vector can then be used to filter \Rcode{data}, as shown previously.
This approach is named here as ``direct'' filtering, as the average abundance for each bin pair is directly compared against a fixed threshold value.
In practice, the direct filter can be combined with \Rcode{count.keep} to ensure that the retained bin pairs have large absolute counts.
<<>>=
direct.keep2 <- direct.keep & count.keep
@
\subsection{Computing the threshold from larger bin pairs}
The procedure above assumes that no additional filtering has been performed during count loading with \Rfunction{squareCounts}, i.e., \Rcode{filter} is set to unity.
Any pre-filtering that removes low-abundance bin pairs will inflate the median and lead to overestimation of the filter threshold.
However, it may not be practical to load counts without pre-filtering.
For example, at small bin sizes, too many non-empty bin pairs may be present to fit into memory.
Counts that are too small will also yield imprecise estimates of the background ligation rate.
To overcome this, counts are loaded for a larger bin size without pre-filtering.
This reduces memory usage as the interaction space is partitioned into fewer bin pairs.
The filter threshold is estimated from the inter-chromosomal interactions as previously described, exploiting the presence of large counts for large bin sizes to achieve precise estimates.
The estimated threshold for the larger bin pairs is then converted into a corresponding threshold for the original (smaller) bin pairs, after adjusting for the differences in bin sizes.
To illustrate, imagine that a bin size of 100 kbp is of interest.
We load the counts for the smaller bin pairs, using a reasonably large \Rcode{filter} to avoid excessive memory consumption.
<<>>=
new.bin.size <- 1e5
smaller.data <- squareCounts(input, mm.param, width=new.bin.size, filter=20)
@
Direct filtering of these smaller bin pairs is performed using \Rfunction{filterDirect}, using the larger (1 Mbp) bin pairs to estimate the filter threshold.
This is done by passing the unfiltered counts for the larger bin pairs in \Robject{data} as the \Rcode{reference} parameter.
The returned statistics can then be used to directly filter the smaller bin pairs.
Here, a minimum 5-fold change over the threshold is required for the retention of each 100 Mbp bin pair.
<<>>=
direct <- filterDirect(smaller.data, reference=data)
direct$threshold
small.keep <- direct$abundances > direct$threshold + log2(5)
summary(small.keep)
@
\section{Filtering as a function of interaction distance}
\subsection{Computing the threshold directly from the bin pair counts}
A more complex filter involves adjusting the threshold according to the distance between the bins in each bin pair.
In typical Hi-C datasets, larger counts are observed at lower interaction distances.
This is probably driven by non-specific compaction of chromatin into a 3D ``globule'' conformation.
If such compaction is uninteresting, a concomitantly higher threshold is necessary to offset the increase in counts for these local interactions \cite{lin2012global}.
In the trended strategy, a trend is fitted to the abundances for all intra-chromosomal bin pairs using the log-distance as the covariate.
The bin size is added to the distance as a prior, to avoid undefined values upon log-transformation when distances are zero.
The fitted value is then used as the threshold for each bin pair.
For inter-chromosomal bin pairs, the threshold is set to that from the direct filtering approach, for completeness.
<<>>=
trended <- filterTrended(data)
@
The effect of this strategy can be visualized by plotting the interaction distance against the normalized abundance.
A power-law relationship between distance and abundance is usually observed in Hi-C data \cite{lieberman2009comprehensive}.
The average abundance (and thus, the filter threshold) decreases as the distance between the interacting loci increases.
<<avedistplot, fig.small=TRUE>>=
smoothScatter(trended$log.distance, trended$abundances,
xlab="Log-Distance", ylab="Normalized abundance")
o <- order(trended$log.distance)
lines(trended$log.distance[o], trended$threshold[o], col="red", lwd=2)
@
The assumption here is that the majority of interactions are generated by non-specific packaging of the linear genome.
Each bin pair is only retained if its abundance is greater than the corresponding fitted value at that distance, i.e., above that expected from compaction.
This favors selection of longer-range interactions, compared to the direct filter.
<<>>=
trend.keep <- trended$abundances > trended$threshold
summary(trend.keep)
@
Of course, the threshold can also be defined at some minimum fold change above the fitted value.
This effectively increases the stringency of the filter.
The example below retains bin pairs with abundances that are two-fold higher than the expected compaction intensity.
<<>>=
trend.keep2 <- trended$abundances > trended$threshold + log2(2)
summary(trend.keep2)
@
The trended filter can also be combined with \Rcode{count.keep} to ensure that the absolute counts are large.
This is particularly important at large distances, where the drop in the threshold may lead to the inappropriate retention of very low abundance bins.
<<>>=
trend.keep3 <- trend.keep & count.keep
@
The distance between bins can also be obtained directly with the \Rfunction{pairdist} function.
Distances for inter-chromosomal bin pairs are marked as \Rcode{NA}.
These tend to dominate the output as they constitute most of the interaction space.
Of course, the majority of these bin pairs will have low counts due to the sparseness of the data.
\subsection{Using larger bin pairs to save memory}
The procedure above assumes that no pre-filtering was performed on the counts for small bin pairs.
In situations with limited memory, the counts for larger bin pairs can again be used to define the thresholds.
The trend is fitted to the abundances and distances of the larger bin pairs, and interpolation (or extrapolation) is performed to obtain fitted values at the distances of the smaller bin pairs.
The interpolated trend can then be applied to filter the smaller bin pairs based on their abundances.
This entire process is done automatically within \Rfunction{filterTrended} when a \Rcode{reference} argument is supplied, as shown below.
<<>>=
trended <- filterTrended(smaller.data, reference=data)
summary(trended$abundances > trended$threshold)
@
Another advantage of this approach is that threshold estimation is more precise with larger counts.
Thus, even if there is enough memory for loading smaller bin pairs with \Rcode{filter=1}, larger bin sizes may still be preferred for computing the filter threshold.
Otherwise, median calculation or trend fitting will be confounded by discreteness at low counts.
\section{Peak-calling in the interaction space}
\subsection{Defining enrichment values for each bin pair}
Peaks in the interaction space are bin pairs that have substantially more reads than their neighbors.
The \Rfunction{enrichedPairs} function counts the number of read pairs in the ``neighborhood'' of each bin pair.
For a bin pair $x$, the function considers several neighborhood regions including bin pairs above or below $x$; in pairs to the left and right of $x$; bin pairs in a box centered at $x$; and for intra-chromosomal bin pairs, the quadrant closest to the diagonal \cite{rao2014kilobase}.
The size of each neighborhood is determined by \Rcode{flank}, defined in terms of bins.
In this case, each bin is around 1 Mbp so the flank width has an actual size of \textasciitilde{}5 Mbp.
<<>>=
flank.width <- 5
en.data <- enrichedPairs(data, flank=flank.width)
en.data
@
The \Rfunction{filterPeaks} function uses the neighborhood counts to compute an enrichment value for each bin pair.
The average abundance across libraries for each neighborhood region is computed, and the region with the largest abundance is chosen.
The enrichment value is defined as the difference between the abundance of the bin pair and that of its chosen neighborhood region (adjusted for the number of bin pairs in the latter).
Bin pairs with high enrichment values are putative peaks that should be retained.
This extends the concept of peak calling on the linear genome (e.g., in ChIP-seq) to the 2-dimensional interaction space \cite{rao2014kilobase}.
Neighborhood regions are defined to capture high-intensity structural features like looping domains, TADs or banding patterns.
This ensures that the structural features themselves do not drive high enrichment values.
Rather, to be called a peak, a bin pair in a structural feature must be enriched relative to other bin pairs in the same feature.
<<>>=
enrichments <- filterPeaks(en.data, get.enrich=TRUE)
summary(enrichments)
@
In this example, an enrichment value above 0.5 is required for retention.
This is a modest threshold that errs on the side of caution, as weak peaks may still be DIs and should be retained for testing.
In practice, filtering of enrichment values should be combined with filtering on absolute abundances.
This eliminates low-abundance bin pairs with high enrichment values, e.g., when the neighborhood is empty.
Near-diagonal elements should also be removed, as these tend to have high enrichment scores without actually being peaks.
All of these steps can be conveniently applied through the \Rfunction{filterPeaks} wrapper function.
<<>>=
peak.keep <- filterPeaks(data, enrichments, min.enrich=0.5,
min.count=5, min.diag=2L)
sum(peak.keep)
@
\subsection{Examining some peak-calling diagnostics}
This filtering strategy can be evaluated by examining the interaction space around each putative peak (see Chapter~\ref{chap:plaid}).
Briefly, the color intensity of each ``pixel'' is proportional to the number of read pairs between the corresponding loci on each axis.
Red boxes mark the bin pairs retained by filtering, where each side represents a bin.
In both examples, the bin pairs correspond nicely to punctate high-intensity contacts in the interaction space.
<<peakdiagplot,echo=FALSE,fig.wide=TRUE,fig.asp=0.6>>=
#keep2 <- peak.keep & !is.na(dist)
#which(keep2)[order(enrichments[keep2], decreasing=TRUE)][1:100]
#plotPlaid(input[3], param=mm.param, width=2e5, cap=20,
# first=resize(anchors(data[chosen], type="first"), fix="center", width=10e6),
# second=resize(anchors(data[chosen], type="second"), fix="center", width=10e6))
a1.list <- GRanges(c("chr11", "chr4"), IRanges(
c(80999224, 153002112),
c(81998782, 154006051)))
a2.list <- GRanges(c("chr11", "chr4"), IRanges(
c(66001477, 130998845),
c(67011476, 131997373)))
capped <- c(50, 50)
par(mfrow=c(1,2))
for (chosen in seq_along(a1.list)) {
if (!suppressWarnings(any(overlapsAny(anchors(data[peak.keep], type="first"), a1.list[chosen], type="equal") &
overlapsAny(anchors(data[peak.keep], type="second"), a2.list[chosen], type="equal")))) { stop("can't find example for peak-calling") }
plotPlaid(input[3], param=mm.param, width=2e5, max.count=capped[chosen],
first.region=resize(a1.list[chosen], fix="center", width=10e6),
second.region=resize(a2.list[chosen], fix="center", width=10e6),
main=paste("Example", chosen))
box()
rect(start(a1.list[chosen]), start(a2.list[chosen]),
end(a1.list[chosen]), end(a2.list[chosen]), border="red")
}
@
Another diagnostic is based on the sparsity of putative peaks.
Bin pairs nested within larger ``parent'' bin pairs are identified using the \Rfunction{boxPairs} function.
Most of these parents should contain low numbers of nested bin pairs.
This is because each peak, by definition, should be isolated in its neighborhood.
The example below considers parent bin pairs of size equal to the neighborhood used to compute the enrichment values,
and calculates the percentage of parents that contain a given number (from 1 to 10, or higher) of nested bin pairs.
<<>>=
neighborhood <- (2*flank.width + 1) * metadata(data)$width
boxed <- boxPairs(data[peak.keep], reference=neighborhood)
out <- tabulate(tabulate(boxed$indices[[1]]))
out <- c(out[1:10], sum(out[11:length(out)])) # sum all >10
setNames(round(out/sum(out)*100, 1), c(1:10, ">10"))
@
Most parents should contain few bin pairs ($\le$ 10, as a rule of thumb), as shown above.
In the second example below, a larger proportion of parents with many nested bin pairs is observed.
This suggests that more aggressive filtering on the enrichment score is required.
<<>>=
peak.keep2 <- filterPeaks(data, enrichments, min.enrich=0,
min.count=5, min.diag=2L)
boxed <- boxPairs(data[peak.keep2], reference=neighborhood)
out <- tabulate(tabulate(boxed$indices[[1]]))
out <- c(out[1:10], sum(out[11:length(out)])) # sum all >10
setNames(round(out/sum(out)*100, 1), c(1:10, ">10"))
@
Obviously, peak calling assumes that changes in intensity of broad interactions are not interesting.
This may not be appropriate for every study.
Indeed, the definition of ``broad'' depends on the bin size, whereby a peak called at large bin sizes may be discarded at smaller sizes.
Users should try out the other, more general filters before proceeding to peak calling, to check that potentially interesting differences are not discarded by the latter.
\subsection{Efficient peak calling at high resolution}
The \Rfunction{enrichedPairs} function requires that all bin pairs are loaded into memory, i.e., with \Rcode{filter=1} during \Rfunction{squareCounts}.
This may not be feasible for high resolution analyses with small bin sizes, where there is a large number of bin pairs with low counts.
In such cases, it may be more practical to use the \Rfunction{neighborCounts} function.
This re-counts the read pairs for each bin pair, storing only a limited portion of the interaction space to compute the neighbourhood counts.
The need to load all non-empty bin pairs is avoided.
Only bin pairs with count sums greater than or equal to \Rcode{filter} are returned, along with neighborhood counts that can be used in \Rfunction{filterPeaks}.
This reduces memory usage at the cost of some speed.
<<>>=
en.data <- neighborCounts(input, param, width=1e5, filter=20, flank=5)
en.data
enrichments <- filterPeaks(en.data, get.enrich=TRUE)
summary(enrichments)
@
\section{Filtering for pre-specified regions}
Filtering for pairs of arbitrary regions is complicated by the potential irregularity of the regions.
In particular, there is no guarantee that the supplied regions will cover the entirety of the interaction space.