Further Analysis and Visualization Guides

Here we give some instructions for further analysis of the valid RNA-DNA interaction map.

Simple Statistics Report

When you got the processed RNA-DNA interaction data in .pairs format, you definitely want to know some basic data statistics, such as number of intra- and inter-chromosomal interactions. We provide imargi_stats.sh to generate a simple text data statistics report. The example command is:

docker run --rm -t -u 1043 -v ~/imargi_example:/imargi zhonglab/imargi imargi_stats.sh \
    -D 5end \
    -d 200000 \
    -i ./output/final_HEK_iMARGI.pairs.gz \
    -o ./output/report_final_HEK_iMARGI.txt

The -d argument sets the genomic distance threshold of defining proximal and distal interactions (intra-chromosomal). It will report the number of proximal and distal interactions. Besides, you can set a list of values separated with ,, such as -d 1000,2000,10000,20000,200000,1000000, then the report will include the statistics number with different thresholds (space is not allowed). The distance is directly calculated based on the mapped 5’ end position (default is -D 5end). You can use -D outer or -D inner argument to set using the distance of farthest or nearest ends of mapped read pair, respectively. When use -D outer or -D inner, the tool will use the extra cigar information in the .pairs data.

The generated report includes these statistics:

  • Total number of RNA-DNA interactions
  • Number of inter-chromosomal interactions
  • Number of intra-chromosomal interactions
  • Number of distal and proximal intra-chromosomal interactions

Distance Filtering of Pairs

In the intra-chromosomal interactions, the most majority are short-range (proximal) interactions, which likely represent interactions between nascent transcripts and their neighboring genomic sequences. We usually filter out those proximal interactions with a distance threshold, which depends on the requirements of further analysis. We provide imargi_distfilter.sh tool for filtering interaction based on interaction genomic distance. The example command is:

docker run --rm -t -u 1043 -v ~/imargi_example:/imargi zhonglab/imargi imargi_distfilter.sh \
    -D 5end \
    -d 20000 \
    -i ./output/final_HEK_iMARGI.pairs.gz \
    -o ./output/filter20k_final_HEK_iMARGI.pairs.gz

The command will filter out those intra-chromosomal interactions with genomic distance less than 200,000 (-d 200000). The distance is directly calculated based on the mapped 5’ end position (default is -D 5end). You can use-D outer or -D inner argument to set using the distance of farthest or nearest ends of read pair, respectively. When use -D outer or -D inner, the tool will use the extra cigar information in the .pairs data.

Annotation of Genomic Features

Annotating RNA-ends and DNA-ends with genomic features are valuable for investigation the biological meanings of RNA-DNA interactions. Using imargi_annotate.sh, users can annotate both RNA and DNA ends with gene annotations in GTF/GFF format or any other genomic features in a simple BED file (each line is a named genomic feature). The example command below will generate two new gene annotation columns named as gene1 and gene2 in the output .pairs format file.

docker run --rm -t -u 1043 -v ~/imargi_example:/imargi zhonglab/imargi imargi_annotate.sh \
    -A gtf \
    -a ./ref/gencode.v24.annotation.gtf \
    -l gene \
    -f gene_id,gene_name,gene_type \
    -C both \
    -c gene1,gene2 \
    -s rn \
    -m 1,1 \
    -G cigar1,cigar2 \
    -i ./output/final_HEK_iMARGI.pairs.gz \
    -o ./output/annot_final_HEK_iMARGI.pairs.gz

There are several important arguments you can use to customize your annotation.

  • -A: Only accept -A gtf for annotating gene with GTF file or -A bed for annotating any other genomic features in a simple BED format file. Default is -A gtf. -a give the annotation information file.
  • -l: Annotation level for GTF file. GTF file has different level information, such as exon and gene body. The default is annotate based on gene body genomic coordinates, -l gene. If you want to strictly annotate, then use -l exon mode.
  • -C: Determine Which end to be annotated. It accepts -C RNA, -C DNA or -C both. Default is -C both, then both RNA and DNA ends will be annotated. The lowercase -c argument give the names of annotation column in the output file. If -C both, then you need set two column names separated with comma ,, i.e., -c gene1,gene2. Space is not allowed. Don’t use the 18 reserved column names to name the annotation columns.
  • -f: annotation feature attributes for GTF file (-A gtf). You can set it based on your GTF file content. Default is -f gene_id,gene_name,gene_type.
  • -s: Strand specific option. The RNA end of iMARGI is reverse-strand-specific, so we annotate its gene locus using its reverse strand.’s’ means strand specific, ‘r’ means reverse-strand-specific, and ‘n’ means none strand specific, ignoring it. So the default is -s rn.
  • -m: Minimum bases of overlapping for annotation. -m 1 means at least one base overlapping. -m 0 means it must be inside of the genomic feature. If annotate both ends with -C both, then you need two values for -m separated by comma ,, such as -m 1,1.
  • -G: cigar information used for annotation. If you only want to annotate based on default position reported in .pairs file, i.e., 5’ end genomic coordinate, then set -G false,false. Default is -G cigar1,cigar2.

Data Format Conversion

For further analysis and visualization, other formats instead of .pairs format might be needed. We provide imargi_convert.sh for converting formats. It can convert .pairs format to BEDPE, .cool and GIVE interaction format with different -f argument options. The example command is below:

docker run --rm -t -u 1043 -v ~/imargi_example:/imargi zhonglab/imargi imargi_convert.sh \
    -f bedpe \
    -i ./output/final_HEK_iMARGI.pairs.gz \
    -o ./output/final_HEK_iMARGI.bedpe.gz

The command will generate a new BEDPE format file from original .pairs format file. By default, it will drop all the extra information columns when convert to BEDPE format. You can use -k argument with column names separated by comma ,, such as -k cigar1,cigar2, to keep these columns in the output BEDPE file.

If you change -f bedpe to -f cool or -f give, it will generate .cool/.mcool format or GIVE interaction format file, respectively, which can be used by HiGlass and GIVE for visualization. With -f cool, it will output both single resolution .cool file and multi-resolution .mcool file. The .mcool file can be used by HiGlass.

Visualization

There are multiple ways for visualization of RNA-DNA interactions. Here we suggest two interactive visualization software, HiGlass and GIVE.

HiGlass

HiGlass supports an interactive heatmap visualization of .cool file. So you just need to use imargi_convert.sh to convert .pairs file to .cool/.mcool file. Please read the HiGlass documentation to know how to use it. Besides, there is an Jupyter Notebook version of HiGlass, jupyter-higlass.

Note: In the iMARGI .pairs file, coordinate of RNA is c1:p1 and coordinate of DNA is c2:p2. We can directly generate .mcool file for HiGlass using imargi_convert.sh script. When HiGlass rendering the heatmap view from the .mcool file, it uses a X-Y coordinates system, where X is c1:p1 and Y is c2:p2, so it will show a heatmap of DNA x RNA matrix, i.e., row is DNA and column is RNA (such as the figure below). Currently, if you want to transpose it, you have to generate a transposed .mcool file. Set -T true when you use imargi_convert.sh script. The HiGlass team will add “customizable transpose” function to its control panel in next update version, then you won’t need to care about this.

_images/higlass_view.pngHiGLass view (row is DNA and column is RNA)

GIVE

GIVE uses links between two traditional linear genome tracks to represents genomic interactions. We just need to use imargi_convert.sh to convert the .pairs format file to GIVE interaction format file. Please read the GIVE tutorial to learn how to use it.

_images/give_view.pngGIVE view