Assessing differences in interactions between peaks

This workflow demonstrates how to use AQuA-tools with AQuA normalized samples in your local environment. Using locally installed AQuA-tools requires specific sample file formatting and parameter use, which are outlined below. For example purposes, this recipe can also be run in Docker and Tinker environments, but it is designed to show how you can work with your own samples not provided in remote environments. For recipes designed for use in Docker or Tinker environments, we recommend starting with Recipe_01: Calling loops from a sample.

This recipe will help you:

Prepare local input files
Define and organize the necessary variables
Run a full workflow step by step
Interpret the biological outcome

We will use the following AQuA-tools:

Motivation

With the assumption that relevant 2D interactions observed in a contact matrix (HiChIP) stem from 1D peaks (ChIP-seq), the value at the matrix bin (pixel) emanating from any two given peaks represents its contact frequency in nuclear space. Therefore, by ‘casting a net’ on the contact matrix using the combinatorics of all possible peaks within a given TAD, we get the entire interaction space between peaks of interest for a given sample. This interaction space can then be compared across conditions to ask: “is there a global observable difference in interaction counts between the interaction spaces of peaks between treatment and control?”

In this recipe, we’ll use two experimental conditions:

Case: H3K27ac HiChIP performed on the cell-line RH4 (DMSO)
Control: H3K27ac HiChIP performed on the cell-line RH4 treated with entinostat, an HDAC inhibitor.

For both experiments, we have paired linear 1D peaks from H3K27ac ChIP-seq experiments.

Chapter 0: Ingredients

I. Prerequisites

You must complete the AQuA-tools local installation before starting, which can be found here.

Installing locally will create the following structure in your home directory:

$HOME/
├── aqua_tools/                  # AQuA tools
├── lab-data/                    # Sample data
├── outputs/                     # To store output
├── genome-wide_loops/           # Sample loops files
├── bed_files/                   # Sample peaks files
├── juicer_tools_1.19.02.jar     # Needed for plot_APA

NOTE: to know your $HOME path, simply type: echo $HOME

II. Format sample `.hic` data and `.mergeStats.txt` files

Running AQuA-tools locally requires a specific folder and filename structure, described here. We’ll format sample data to match those requirements for illustration purposes here. To use your own samples, use your sample paths and make sure your .hic and .txt files meet the formatting requirements.

# -----------------------------------------------------------------------------
# Running AQuA-tools locally requires a specific sample folder and filename
# structure. This example formats provided sample data to match those requirements.
# If using your own samples, provide your own paths and ensure that your .hic and
# .mergeStats.txt files match the expected naming and folder layout:
#
#   ~/SAMPLE/
#       ├── SAMPLE.hic
#       └── SAMPLE.mergeStats.txt
#
# -----------------------------------------------------------------------------
# Each sample folder must include a .hic file and a .mergeStats.txt file.
#
# A minimal .mergeStats.txt file must contain at least this row:
#     valid_interaction_rmdup    [human_count]
#
# If a second reads column is present, AQuA-tools will use AQuA normalization.
# Otherwise, it will default to CPM normalization.
#
#     valid_interaction_rmdup    [human_count] [mouse_count]
# -----------------------------------------------------------------------------
#
# Even when running this tutorial in Docker, this formatting step is required
# because we'll be using the `--sample1_dir` argument with `query_bedpe`, which
# expects a full path to a sample folder. The `--sample1_dir` argument allows
# you to use your own local data by pointing to a folder that follows the
# expected structure.
#
# --------------------------
# Sample 1: RH4 DMSO H3K27ac
# --------------------------

# Provide full path to sample1's folder
sample1=$HOME/lab-data/hg38/RH4_DMSO_H3K27ac/RH4_DMSO_H3K27ac_version_1

# Rename the .hic file to follow required format: SAMPLE/SAMPLE.hic
mv "$sample1/RH4_DMSO_H3K27ac_version_1.allValidPairs.hic" \
   "$sample1/RH4_DMSO_H3K27ac_version_1.hic"

# Rename mergeStats.txt to match sample name
mv "$sample1/mergeStats.txt" \
   "$sample1/RH4_DMSO_H3K27ac_version_1.mergeStats.txt"

# --------------------------
# Sample 2: RH4 HDACi H3K27ac
# --------------------------

# Provide full path to sample2's folder
sample2=$HOME/lab-data/hg38/RH4_HDACi_H3K27ac/RH4_HDACi_H3K27ac_version_1

# Rename the .hic file to follow required format: SAMPLE/SAMPLE.hic
mv "$sample2/RH4_HDACi_H3K27ac_version_1.allValidPairs.hic" \
   "$sample2/RH4_HDACi_H3K27ac_version_1.hic"

# Rename mergeStats.txt to match sample name
mv "$sample2/mergeStats.txt" \
   "$sample2/RH4_HDACi_H3K27ac_version_1.mergeStats.txt"

III. Gathering necessary files

For this workflow, we will need H3K27ac peaks files (.bed), a reference TAD file (.bed), a reference CTCF file (.bed from ENCODE-3 cCRE) and a TSS annotation file (.bed from GENCODE). We will assign these files to variable names for simpler use.

NOTE: we are using reference TAD and CTCF annotations for this recipe, as HiC-informed TAD calls and ChIP-informed CTCF peaks are lacking.

# Initialise files in variables
DMSO_peaks=$HOME/bed_files/RH4_DMSO_H3K27ac-peaks_GSM3229491_hg38.bed

HDACi_peaks=$HOME/bed_files/RH4_HDACi_H3K27ac-peaks_GSM3229493_hg38.bed

TSS=$HOME/lab-data/hg38/reference/GENCODE_TSSs_hg38.bed

CTCF=$HOME/lab-data/hg38/reference/ENCODE3_cCRE-CTCF_hg38.bed

TAD=$HOME/lab-data/hg38/reference/TAD_goldsorted_span_centromeres-removed_hg38.bed

Chapter 1: Building interaction spaces

I. Building the `.bedpe`

The AQuA tool build_bedpe can create the scaffold for assessing the combinatorial space of all possible 2D interactions using .bed files. The resulting output is a .bedpe file, which can be then used to query values from a contact matrix. The total set of possible interactions from a build_bedpe self-build is represented in Figure 1. For more information about how build_bedpe works, see our Docs.

self-build

Fig 1. All possible interactions resulting from a build_bedpe self-build. Bed elements are depicted along the diagonal and interactions are shown in the triangle.

In this example, we’ll create two self-interaction .bedpe files, one for each condition (DMSO and HDACi), using their respective peak files. We set a minimum distance of 25kb between interaction anchors to avoid short-range contacts near the diagonal, and we restrict interactions to occur within TADs. This setup gives us the best chance of capturing interactions anchored in open chromatin with minimal noise from short-range contacts.

cd $HOME

build_bedpe \
 --bed_A $DMSO_peaks \
 --bed_B $DMSO_peaks \
 --min_dist 25000    \
 --TAD $TAD  > $HOME/outputs/RH4_DMSO_peaks.bedpe

build_bedpe \
 --bed_A $HDACi_peaks \
 --bed_B $HDACi_peaks \
 --min_dist 25000     \
 --TAD $TAD  > $HOME/outputs/RH4_HDACi_peaks.bedpe

Chapter 2: Constraining interaction spaces

I. Intersecting `.bedpe` with a TSS file

We can then reduce this combinatorial space of interacting peaks to interactions that have at least one TSS on either end of a .bedpe row using intersect_bedpe. This narrows our combinatorial scaffold of interactions to contain genes found in open chromatin and possibly poised for transcription. For more information about how intersect_bedpe works, see our Docs.

intersect_bedpe \
 --bed_A $TSS \
 --bedpe $HOME/outputs/RH4_DMSO_peaks.bedpe \
    > $HOME/outputs/RH4_DMSO_peaks-tss.bedpe

intersect_bedpe \
 --bed_A $TSS \
 --bedpe $HOME/outputs/RH4_HDACi_peaks.bedpe \
    > $HOME/outputs/RH4_HDACi_peaks-tss.bedpe

II. Intersecting `.bedpe` with a CTCF file

We’ll use intersect_bedpe again, this time with a .bed file containing reference CTCF sites obtained from ENCODE-3. By intersecting with CTCF sites, we narrow our focus to interactions mediated or stabilized by CTCF among regions already anchored at transcription start sites. This gives us a focused set of promoter-associated loops to continue our analysis with.

intersect_bedpe \
 --bed_A $CTCF \
 --bedpe $HOME/outputs/RH4_HDACi_peaks-tss.bedpe \
    > $HOME/outputs/RH4_DMSO_peaks-tss-ctcf.bedpe

intersect_bedpe \
 --bed_A $CTCF \
 --bedpe $HOME/outputs/RH4_HDACi_peaks-tss.bedpe \
    > $HOME/outputs/RH4_HDACi_peaks-tss-ctcf.bedpe

self-build

Fig 2. Reduction in interactions resulting from intersect_bedpe. Starting with the full set of interactions obtained from a self-build build_bedpe call, interactions are restricted to those that intersect with a TSS, and then restricted again to those that intersect with a CTCF site. TSSs are shown in red along the diagonal. CTCF sites are shown in blue along the diagonal. Filtered .bed elements are shown as gray boxes along the diagonal.

Chapter 3: Refining interaction spaces

I. Querying the `.bedpe`

The constrained set of interaction space obtained using steps above can then be paired with a contact matrix to query the respective interaction values for each row.

We will use query_bedpe with --formula max and --fix false to refine our input .bedpe coordinates. As the interaction space for any two given peaks can span multiple pixels in the contact matrix, choosing the max pixel hones into a single number with the most signal and keeps all .bedpe coordinates fixed to the resolution of choice (in this case the default of 5kb).

The refining process is visually represented in Figure 3. For more information about how query_bedpe works, see our Docs.

query

Fig 3. Refined interaction coordinates obtained from query_bedpe using --formula max and --fix FALSE.

For this step, we’ll be using our sample data folders which contain our .hic and .mergeStats.txt files.

NOTE: It’s important to use the --sample1_dir parameter when working with your own samples (or when following this recipe exactly for example purposes), rather than the --sample1 parameter, which is used in other recipes designed for use in Docker or Tinker environments.

query_bedpe \
 --bedpe $HOME/outputs/RH4_DMSO_peaks-tss-ctcf.bedpe \
 --sample1_dir $sample1 \
 --formula max \
 --fix FALSE > $HOME/outputs/RH4_DMSO_query.bedpe

query_bedpe \
 --bedpe $HOME/outputs/RH4_HDACi_peaks-tss-ctcf.bedpe \
 --sample1_dir $sample2 \
 --formula max \
 --fix FALSE > $HOME/outputs/RH4_HDACi_query.bedpe

Before continuing, let’s quickly check our output files to make sure we have bedpe coordinates in the first 6 columns and that we did not pick up any error messages.

head -n 2 "$HOME/outputs/RH4_DMSO_query.bedpe" | cut -f1-6

# chr10  102130000  102135000  chr10  102635000  102640000
# chr10  103965000  103970000  chr10  104205000  104210000

head -n 2 "$HOME/outputs/RH4_HDACi_query.bedpe" | cut -f1-6

# chr10  102130000  102135000  chr10  102635000  102640000
# chr10  103965000  103970000  chr10  104205000  104210000

Everything looks good here! Let’s move on to Chapter 4.

Chapter 4: Merging interaction spaces

I. Merging the `.bedpe`

So far, we have been working with each condition (DMSO and HDACi) separately, creating and refining the interaction spaces of each dataset.

Now, we will use union_bedpe to merge the spaces into a single .bedpe file. This step takes the union of overlapping interaction regions from the two conditions, creating a unified set of candidate loops for comparison. For more information about how union_bedpe works, see Figure 4 as well as our Docs.

union

Fig 4. Merged interaction coordinates obtained from union_bedpe.

union_bedpe \
 --bedpe $HOME/outputs/RH4_DMSO_query.bedpe \
 --bedpe $HOME/outputs/RH4_HDACi_query.bedpe \
   > $HOME/outputs/RH4_union.bedpe

Chapter 5: Visualizing interactions

I. Two-sample `plot_APA` with AQuA normalization

Finally, we can visualize the aggregate contact signal of the merged interaction space across the two conditions using plot_APA.

We’ll make a two-sample delta APA plot using both the RH4_DMSO and RH4_HDACi samples and our unified .bedpe. We’ll run plot_APA twice: once using CPM normalization and once using AQuA normalization. Because this dataset includes spike-in mouse reads, AQuA normalization is applied by default. For more information about how plot_APA works, see our Docs.

plot_APA \
 --bedpe $HOME/outputs/RH4_union.bedpe  \
 --sample1_dir $sample1  \
 --sample2_dir $sample2  \
 --out-dir $HOME/outputs/RH4_aqua

II. Two-sample `plot_APA` with CPM normalization

plot_APA \
 --bedpe $HOME/outputs/RH4_union.bedpe \
 --sample1_dir $sample1  \
 --sample2_dir $sample2  \
 --norm cpm                  \
 --out-dir $HOME/outputs/RH4_cpm

To view the resulting APA plots, navigate to the output directories created for each plot and view the .pdf.

CPM output: APA_cpm

AQuA output: APA_AQUA

Conclusion

In this recipe, we:

Built the entire possible interaction space between linear H3K27ac peaks for each condition
Constrained the interaction space to contain TSSs
Further constrained the interaction space to contain CTCF sites
Refined the coordinates of the interaction space to correspond to the maximum contact values
Merged the interaction spaces across conditions
Visualized the aggregate of the merged space in CPM and AQuA normalizations

Using CPM normalization, we observe lost interactions between peaks (centred ‘+’ blue) amongst a crackle of surrounding gains. However, when using AQuA, we observe a global overall gain in interactions. This pattern is consistent with findings from Gryder et al. 2020, who showed that AQuA normalization using an orthogonal spike-in can uncover global changes in chromatin interactions that CPM may miss. Specifically, AQuA reveals broad increases in H3K27ac-mediated looping after HDAC inhibition, which points to a shift in enhancer–promoter architecture beyond individual loops.

Why does that matter? Many biological processes depend not just on a handful of punctate loops, but on an overall loosening or tightening of the chromatin landscape. In other words, if you only focus on individual hotspots, you risk missing the forest for the trees: AQuA helps reveal when the entire forest is greening.

We designed the AQuA-tools suite to make it easy to take full advantage of the AQuA normalization method. AQuA-tools that use sample data automatically check the .mergeStats.txt file to determine whether spike-in reads are present. If spike-in data is found, AQuA normalization is applied by default. If not, the tools default to CPM normalization. Our hope is to simplify the process of analyzing your own samples, whether they include spike-in or not.

Once you’ve completed this recipe, we recommend exploring the next recipe designed for Tinker and Docker environments. There, you’ll be introduced to extract_bedpe, a powerful AQuA tool that uses inherent normalization to call loops directly from HiChIP contact matrices, with no peak file required. These next steps offer a novel and flexible approach to discovering chromatin interactions and are perfect for diving deeper into what AQuA-tools can do.

Gryder, B. E., Khan, J., & Stanton, B. Z. (2020). Measurement of differential chromatin interactions with absolute quantification of architecture (AQuA-HiChIP). Nature protocols, 15(3), 1209–1236. https://doi.org/10.1038/s41596-019-0285-9