Skip to content

Creating meaningful subsets

This is a workflow on how to use AQuA-tools in a Docker or Tinker environment, and is designed to help you create meaningful subsets after calling loops.

We will use the following AQuA tools:

Motivation

In the previous recipe we called interactions directly from the contact matrix using extract_bedpe. This approach is great for finding loops, but figuring out which ones are biologically meaningful will require filtering. In this recipe, we introduce several AQuA-tools that help create meaningful subsets from genome-wide loop calls.

In this recipe, we’ll use two independent cell-lines:
sample1: H3K27ac HiChIP performed on the cell-line GM12878
sample2: H3K27ac HiChIP performed on the cell-line K562

Chapter 0: Ingredients

I. Prerequisites

You must complete the AQuA-tools Docker installation before starting, which can be found here. Alternatively, if you have access to Tinker, AQuA-tools are already pre-installed and no installation is needed.

II. Creating a workspace

This is an optional step. We encourage users to organize their analyses into a directory structure that keeps things nice and clean.

Terminal window
working_directory=/home/ubuntu/container_outputs
mkdir -p $working_directory
cd $working_directory
mkdir -p \
data \
scripts \
results \
scratch

It’s nice to have all the files you want to play with in $working_directory/data

For now, we’ll follow the steps of this tutorial by housing ourselves in $working_directory/scratch, but we recommend writing .sh scripts $working_directory/scripts and storing the results in $working_directory/results

III. Gathering necessary files

For downstream analysis, we will need the following files:

  • TAD .bed file
  • TSS .bed file
  • CTCF .bed file
  • Genome-wide loop .bedpe files

We’ll assign the 3 .bed file paths to variable names for easier calling later:

Terminal window
cd $working_directory/scratch
# Initialise files in variables
TAD=/home/ubuntu/lab-data/hg38/reference/TAD_goldsorted_span_centromeres-removed_hg38.bed
TSS=/home/ubuntu/lab-data/hg38/reference/GENCODE_TSSs_hg38.bed
CTCF=/home/ubuntu/lab-data/hg38/reference/ENCODE3_cCRE-CTCF_hg38.bed

For the genome-wide loops .bedpe files, you must either run the last chapter in the previous recipe or have access to the pre-computed genome-wide loop files on Docker at /home/ubuntu/genome-wide_loops. If you are using the pre-computed loop files on Docker, we’ll rename them slightly for clarity in this recipe. If you computed the files yourself in the previous recipe, no re-naming is necessary.

Terminal window
# To use pre-computed genome-wide bedpe files on Docker
# navigate to scratch directory
cd $working_directory/scratch
# GM12878
# Make a copy in our scratch directory
cp $HOME/genome-wide-loops/GM12878_H3K27ac_gw-loops_hg38.bedpe .
# Rename file
mv GM12878_H3K27ac_gw-loops_hg38.bedpe GM12878_genome-wide-loops.bedpe
# K562
# Make a copy in our scratch directory
cp $HOME/genome-wide-loops/K562_H3K27ac_gw-loops_hg38.bedpe .
# Rename file
mv K562_H3K27ac_gw-loops_hg38.bedpe K562_genome-wide-loops.bedpe

IV. Defining samples

We’ll need a sample to get started. In Tinker or Docker environments, you’ll use the name of the sample as it appears on the Tinker box or Docker container.

Terminal window
# on Tinker or Docker
sample1=GM12878_H3K27ac
sample2=K562_H3K27ac

NOTE: You can list all loaded samples on your tinkerbox or Docker container using list_samples

Chapter 1: Constraining interaction spaces

I. Intersecting .bedpe with a TSS file

We can reduce the number of extracted loops to those that have at least one TSS on either end of a .bedpe row using intersect_bedpe. We’ll run intersect_bedpe two times, once for each sample. For more information about how intersect_bedpe works, see our Docs.

Terminal window
# intersections for GM12878
intersect_bedpe \
--bed_A $TSS \
--bedpe GM12878_genome-wide-loops.bedpe \
> GM12878_gw-loops_tss-int.bedpe
# intersections for K562
intersect_bedpe \
--bed_A $TSS \
--bedpe K562_genome-wide-loops.bedpe \
> K562_gw-loops_tss-int.bedpe

II. Intersecting .bedpe with a CTCF file

We can further reduce the number of extracted loops to those that have at least one CTCF site on either end of a .bedpe row using intersect_bedpe. We’ll run intersect_bedpe another two times, once for each sample.

Terminal window
# intersections for GM12878
intersect_bedpe \
--bed_A $CTCF \
--bedpe GM12878_gw-loops_tss-int.bedpe \
> GM12878_gw-loops_tss-ctcf-int.bedpe
# intersections for K562
intersect_bedpe \
--bed_A $CTCF \
--bedpe K562_gw-loops_tss-int.bedpe \
> K562_gw-loops_tss-ctcf-int.bedpe

Chapter 2: Merging interactions

I. Merge the called genome-wide loops into one file

Now that we’ve narrowed our loops to those with at least one anchor overlapping a TSS or CTCF site, we will use union_bedpe to merge the spaces into a single .bedpe file. Here, we take the union of overlapping interaction regions from the two samples and create a unified set of candidate loops for comparison. For more information about how union_bedpe works, see our Docs.

Terminal window
union_bedpe \
--bedpe GM12878_gw-loops_tss-ctcf-int.bedpe \
--bedpe K562_gw-loops_tss-ctcf-int.bedpe > union_scaffold.bedpe

Chapter 3: Annotating interactions

I. Annotate union scaffold bedpe with contact values

Next, we will run a two-sample query_bedpe analysis to annotate each loop in the union scaffold with contact values and delta contact values. We’ll use --inherent TRUE to apply inherent normalization, which will standardize the contact values and allow us to compare interaction strengths between samples. To know more about inherent scores, check out our Docs.

In addition to retrieving contact strengths, query_bedpe can also refine our loop coordinates. By setting --formula max and --fix false, we update each loop’s coordinates to the bin pair with the highest contact value within the original region, allowing us to focus on the most relevant signal. For more information about how query_bedpe works, see our Docs.

Terminal window
query_bedpe \
--bedpe union_scaffold.bedpe \
--sample1 $sample1 \
--sample2 $sample2 \
--genome hg38 \
--formula max \
--fix false \
--inherent TRUE > union_scaffold_inh-annotated.bedpe

II. Threshold with standard inherent score 1

Since we normalized our contact scores with inherent normalization, we can easily select a threshold to filter our pairs. Most normalized loop scores fall between 0 and 1, with higher values indicating stronger-than-expected contact. When we subtract one sample’s normalized score from another to calculate delta, the resulting values are centered around zero. Here, we use a threshold of >1 to define loops with strong gains in contact, and <−1 to define loops with strong losses.

Terminal window
awk '$NF>= 1{print $0}' union_scaffold_inh-annotated.bedpe > loop-gains.bedpe
awk '$NF<=-1{print $0}' union_scaffold_inh-annotated.bedpe > loop-losses.bedpe

Chapter 4: Visualizing interactions

V. Visualise gains and losses with plot_APA

We’ll use plot_APA to visualize the aggregated contact signal for our gains and losses. To increase the detail in our visualization, we can set the --resolution of the plot_APA output from the 5 kb default to 1 kb bins. Additionally, since the number of rows differs between the two files, we’ll use --loop_norm TRUE to standardize the contact scores by how many loops are in the file. For more information about how plot_APA works, see our Docs.

Terminal window
plot_APA \
--bedpe loop-gains.bedpe \
--sample1 $sample1 \
--sample2 $sample2 \
--genome hg38 \
--resolution 1000 \
--loop_norm TRUE \
--out-dir ~/container_outputs/gains
plot_APA \
--bedpe loop-losses.bedpe \
--sample1 $sample1 \
--sample2 $sample2 \
--genome hg38 \
--resolution 1000 \
--loop_norm TRUE \
--out-dir ~/container_outputs/losses

Here’s our output!

Gains: plot1

Losses: plot2

Conclusion

By viewing the delta APA plots for loop gains and losses, we can visually confirm that our filtered subsets represent cell-type-specific contact enrichment. The magenta signal in the gains set indicates increased contact frequency in K562 relative to GM12878, while the blue signal in the losses set reflects loops with stronger contact in GM12878. These contrasting patterns suggest that the differential loops we identified are not random, but instead reflect meaningful biological differences in chromatin organization between the two cell types.

We created these loop subsets with a streamlined AQuA-tools workflow. We started with genome-wide loops called with extract_bedpe and used intersect_bedpe to filter for loops anchored near biologically relevant features such as TSSs and CTCF sites. Then, query_bedpe refined our coordinates and annotated our loops with inherent normalized contact scores directly from the contact matrices. This allowed us to categorize loops into gains and losses based on a consistent threshold. From there, union_bedpe merged loops from both cell types into a single scaffold for comparison. Finally, plot_APA allowed us to visualize loop behavior between samples.

Our goal in this workflow was to highlight the power and flexibility of AQuA-tools for HiChIP analysis and demonstrate how this modular set of tools can guide discovery in complex 3D genomic data. With these types of workflows, we hope to support researchers in asking their own questions of the data without needing to build custom pipelines from the ground up.

In the next recipe, we’ll continue our exploration by looking into how to organize overlapping interactions into meaningful groups using cluster_bedpe. We’ll introduce annotate_cluster, which adds biological context to each cluster by incorporating peak files and gene annotations.