Creating meaningful subsets
This is a workflow on how to use AQuA-tools in a Docker or Tinker environment, and is designed to help you create meaningful subsets after calling loops.
We will use the following AQuA tools:
Motivation
In the previous recipe we called interactions directly from the contact matrix using extract_bedpe
. This approach is great for finding loops, but figuring out which ones are biologically meaningful will require filtering. In this recipe, we introduce several AQuA-tools that help create meaningful subsets from genome-wide loop calls.
In this recipe, we’ll use two independent cell-lines:
sample1: H3K27ac HiChIP performed on the cell-line GM12878
sample2: H3K27ac HiChIP performed on the cell-line K562
Chapter 0: Ingredients
I. Prerequisites
You must complete the AQuA-tools Docker installation before starting, which can be found here. Alternatively, if you have access to Tinker, AQuA-tools are already pre-installed and no installation is needed.
II. Creating a workspace
This is an optional step. We encourage users to organize their analyses into a directory structure that keeps things nice and clean.
working_directory=/home/ubuntu/container_outputsmkdir -p $working_directorycd $working_directorymkdir -p \ data \ scripts \ results \ scratch
It’s nice to have all the files you want to play with in $working_directory/data
For now, we’ll follow the steps of this tutorial by housing ourselves in $working_directory/scratch
, but we recommend writing .sh
scripts $working_directory/scripts
and storing the results in $working_directory/results
III. Gathering necessary files
For downstream analysis, we will need the following files:
- TAD
.bed
file - TSS
.bed
file - CTCF
.bed
file - Genome-wide loop
.bedpe
files
We’ll assign the 3 .bed
file paths to variable names for easier calling later:
cd $working_directory/scratch
# Initialise files in variablesTAD=/home/ubuntu/lab-data/hg38/reference/TAD_goldsorted_span_centromeres-removed_hg38.bed
TSS=/home/ubuntu/lab-data/hg38/reference/GENCODE_TSSs_hg38.bed
CTCF=/home/ubuntu/lab-data/hg38/reference/ENCODE3_cCRE-CTCF_hg38.bed
For the genome-wide loops .bedpe
files, you must either run the last chapter in the previous recipe or have access to the pre-computed genome-wide loop files on Docker at /home/ubuntu/genome-wide_loops
. If you are using the pre-computed loop files on Docker, we’ll rename them slightly for clarity in this recipe. If you computed the files yourself in the previous recipe, no re-naming is necessary.
# To use pre-computed genome-wide bedpe files on Docker
# navigate to scratch directorycd $working_directory/scratch
# GM12878# Make a copy in our scratch directorycp $HOME/genome-wide-loops/GM12878_H3K27ac_gw-loops_hg38.bedpe .
# Rename filemv GM12878_H3K27ac_gw-loops_hg38.bedpe GM12878_genome-wide-loops.bedpe
# K562# Make a copy in our scratch directorycp $HOME/genome-wide-loops/K562_H3K27ac_gw-loops_hg38.bedpe .
# Rename filemv K562_H3K27ac_gw-loops_hg38.bedpe K562_genome-wide-loops.bedpe
IV. Defining samples
We’ll need a sample to get started. In Tinker or Docker environments, you’ll use the name of the sample as it appears on the Tinker box or Docker container.
# on Tinker or Dockersample1=GM12878_H3K27acsample2=K562_H3K27ac
NOTE: You can list all loaded samples on your tinkerbox or Docker container using list_samples
Chapter 1: Constraining interaction spaces
I. Intersecting .bedpe
with a TSS file
We can reduce the number of extracted loops to those that have at least one TSS on either end of a .bedpe
row using intersect_bedpe
. We’ll run intersect_bedpe
two times, once for each sample. For more information about how intersect_bedpe
works, see our Docs.
# intersections for GM12878intersect_bedpe \ --bed_A $TSS \ --bedpe GM12878_genome-wide-loops.bedpe \ > GM12878_gw-loops_tss-int.bedpe
# intersections for K562intersect_bedpe \ --bed_A $TSS \ --bedpe K562_genome-wide-loops.bedpe \ > K562_gw-loops_tss-int.bedpe
II. Intersecting .bedpe with a CTCF file
We can further reduce the number of extracted loops to those that have at least one CTCF site on either end of a .bedpe row using intersect_bedpe
. We’ll run intersect_bedpe
another two times, once for each sample.
# intersections for GM12878intersect_bedpe \ --bed_A $CTCF \ --bedpe GM12878_gw-loops_tss-int.bedpe \ > GM12878_gw-loops_tss-ctcf-int.bedpe
# intersections for K562intersect_bedpe \ --bed_A $CTCF \ --bedpe K562_gw-loops_tss-int.bedpe \ > K562_gw-loops_tss-ctcf-int.bedpe
Chapter 2: Merging interactions
I. Merge the called genome-wide loops into one file
Now that we’ve narrowed our loops to those with at least one anchor overlapping a TSS or CTCF site, we will use union_bedpe
to merge the spaces into a single .bedpe
file. Here, we take the union of overlapping interaction regions from the two samples and create a unified set of candidate loops for comparison. For more information about how union_bedpe
works, see our Docs.
union_bedpe \ --bedpe GM12878_gw-loops_tss-ctcf-int.bedpe \ --bedpe K562_gw-loops_tss-ctcf-int.bedpe > union_scaffold.bedpe
Chapter 3: Annotating interactions
I. Annotate union scaffold bedpe with contact values
Next, we will run a two-sample query_bedpe
analysis to annotate each loop in the union scaffold with contact values and delta contact values. We’ll use --inherent TRUE
to apply inherent normalization, which will standardize the contact values and allow us to compare interaction strengths between samples. To know more about inherent scores, check out our Docs.
In addition to retrieving contact strengths, query_bedpe
can also refine our loop coordinates. By setting --formula max
and --fix false
, we update each loop’s coordinates to the bin pair with the highest contact value within the original region, allowing us to focus on the most relevant signal. For more information about how query_bedpe works, see our Docs.
query_bedpe \ --bedpe union_scaffold.bedpe \ --sample1 $sample1 \ --sample2 $sample2 \ --genome hg38 \ --formula max \ --fix false \ --inherent TRUE > union_scaffold_inh-annotated.bedpe
II. Threshold with standard inherent score 1
Since we normalized our contact scores with inherent normalization, we can easily select a threshold to filter our pairs. Most normalized loop scores fall between 0 and 1, with higher values indicating stronger-than-expected contact. When we subtract one sample’s normalized score from another to calculate delta, the resulting values are centered around zero. Here, we use a threshold of >1 to define loops with strong gains in contact, and <−1 to define loops with strong losses.
awk '$NF>= 1{print $0}' union_scaffold_inh-annotated.bedpe > loop-gains.bedpeawk '$NF<=-1{print $0}' union_scaffold_inh-annotated.bedpe > loop-losses.bedpe
Chapter 4: Visualizing interactions
V. Visualise gains and losses with plot_APA
We’ll use plot_APA
to visualize the aggregated contact signal for our gains and losses. To increase the detail in our visualization, we can set the --resolution
of the plot_APA
output from the 5 kb default to 1 kb bins. Additionally, since the number of rows differs between the two files, we’ll use --loop_norm TRUE
to standardize the contact scores by how many loops are in the file. For more information about how plot_APA
works, see our Docs.
plot_APA \ --bedpe loop-gains.bedpe \ --sample1 $sample1 \ --sample2 $sample2 \ --genome hg38 \ --resolution 1000 \ --loop_norm TRUE \ --out-dir ~/container_outputs/gains
plot_APA \ --bedpe loop-losses.bedpe \ --sample1 $sample1 \ --sample2 $sample2 \ --genome hg38 \ --resolution 1000 \ --loop_norm TRUE \ --out-dir ~/container_outputs/losses
Here’s our output!
Gains:
Losses:
Conclusion
By viewing the delta APA plots for loop gains and losses, we can visually confirm that our filtered subsets represent cell-type-specific contact enrichment. The magenta signal in the gains set indicates increased contact frequency in K562 relative to GM12878, while the blue signal in the losses set reflects loops with stronger contact in GM12878. These contrasting patterns suggest that the differential loops we identified are not random, but instead reflect meaningful biological differences in chromatin organization between the two cell types.
We created these loop subsets with a streamlined AQuA-tools workflow. We started with genome-wide loops called with extract_bedpe
and used intersect_bedpe
to filter for loops anchored near biologically relevant features such as TSSs and CTCF sites. Then, query_bedpe
refined our coordinates and annotated our loops with inherent normalized contact scores directly from the contact matrices. This allowed us to categorize loops into gains and losses based on a consistent threshold. From there, union_bedpe
merged loops from both cell types into a single scaffold for comparison. Finally, plot_APA
allowed us to visualize loop behavior between samples.
Our goal in this workflow was to highlight the power and flexibility of AQuA-tools for HiChIP analysis and demonstrate how this modular set of tools can guide discovery in complex 3D genomic data. With these types of workflows, we hope to support researchers in asking their own questions of the data without needing to build custom pipelines from the ground up.
In the next recipe, we’ll continue our exploration by looking into how to organize overlapping interactions into meaningful groups using cluster_bedpe
. We’ll introduce annotate_cluster
, which adds biological context to each cluster by incorporating peak files and gene annotations.