Data processing step-by-step

Demultiplexing

MUSIC library configuration

Demultiplexing is the step to extract the following important informations from the sequencing reads and output insertions with barcode, UMI information as the read name.

Name Length Total combinations Start location
Cell Barcode # 3 14-17 bp 96 Read2 bp 1
Cell Barcode # 2 14bp 96 Read2 bp 21 - 24
Cell Barcode # 1 14bp 96 Read2 bp 42 - 45
RNA/DNA Linkers 20bp / 24bp NA Read2 bp 63 - 66
Molecular Barcode (10x barcode) 16bp 3.5 million Read1 bp 1
UMI 12bp NA Read1 bp 13

The cell barcode is designed such that the levenshtein distance between any two of the cell barcode is larger than 2. There is 7bp linker sequence between cell barcodes, and between cell barcodes and adaptor sequence. For cell barcode or adaptor sequences, we allow at most 2 mismatches to the reference sequence. For 10X barcode (Chemistry: Single Cell 3’ v1), we only allow perfect match.

Demultiplexing program

We used a customized python script to finish this step.

Type Description
Tool 1____combo_barcode_resolve.py
Input Pairend fastq files in gzip format. R1 is 28bp and R2 is 150bp.
Output Two single end fastq files. One for DNA and the other is RNA. Each sequence is the DNA/RNA insertion sequence, with the read header contains raw read name| BC3_BC2_BC1 - 10x barcode # UMI.

In the Snakemake pipeline, the command line is:

// Snakemake
python3 script/1____combo_barcode_resolve.py \
-R1 {input.all_read1} -R2 {input.all_read2} \
-lib {params.lib_name} \
-outRNA {output.outRNA_fq} \
-outDNA {output.outDNA_fq} \
-log {output.decode_log}

Sample Input:

# Input Read1 sequences (28bp):
@MN00185:343:000H532W5:1:11102:15153:1026 1:N:0:TCGTCTGA
NTCGTGAGTCACTTCCTGAGCTTAATGG
+
#AFFFFFFFFFFFFFFFFFFFFFFFFFF
@MN00185:343:000H532W5:1:11102:15334:1027 1:N:0:TCGTCTGA
NCTATCGGTGTTAAAGGGTTCGGAGACT
+
#AFFFFFFFFFFFFFFFFFFFFFFFFFF


# Input Read2 sequences (150bp)
@MN00185:343:000H532W5:1:11102:15153:1026 2:N:0:TCGTCTGA
CNNTCCTANCTNNANGTNCTCGTACAATCNAANGCNANGTGCTCANAGNGNNTCCAANACGAGGANTACNNACAACNCACANTGNGNAGTCTNTANANNGTGCNNATNACTAANAAANNAAAAAANANANANAAAANNNNANNNANNAAN
+
A##FFFFF#FF##F#FF#FFFFFFFFFFF#FF#FF#F#FFFFFFF#/F#F##FFFFF#FFFFFFF#F/F##FFFFF#F/F=#FA#/#AFFAF#//#=##//FF##FF#///FA#=FF##AFF/FF#/#/#/#//FF####F###/##//#
@MN00185:343:000H532W5:1:11102:15334:1027 2:N:0:TCGTCTGA
CNNGTTAANACNNANATNTTGTTTTCCCCNGGNTANANAATTGCTNCANTNNTTTTANTAATTTGTATCNNTTCATNCCCANATNTNATTCCNCTNANNTTGTNNAGNTATATNAGGNNGTCTATNANTNTNATCTNNNNANNNANNACN
+
A##FFFFF#FF##F#FF#FFFFFFFFFFF#FF#FF#F#FFFFFFF#FF#F##FFFFF#FFFAFFFFFFF##FFFFF#/==F#AF#/#/FFF6#6F#/##/FFF##/F#/FFFF#/F=##/FFFAF#/#/#/#/FFA####/###/##//#

Sample output:

# DNA end decoded fastq
@MN00185:343:000H532W5:1:11102:15055:1047|BC3_75-BC2_51-BC1_11-CGGAACCAGGCATCGA-lib1#GGTCTGTACCTG
TCATAAAGAACTCTACCAACTTAAGAAGAAGAAAGCAAGCAACCCCATACAAAAATGGCGA
+
=F/FA=/F=/F=FF/FF///6/A=AAFFF/A/=//F///FF/FFFFF/FF///FFF/F///
@MN00185:343:000H532W5:1:11102:12653:1047|BC3_3-BC2_23-BC1_31-GAACTGTTCCAAGCAT-lib1#ACCATGCCATCA
TCAAATTCCACAAAAAGAGTGTTTCAAATCTGCTCTGTGACTAGACACTGTGCGTATCTATAAA
+
FFFFFFFFF6FAFFFFFFFFFFFFF=FFFFFFF=FFFFFFFFF//F/FFFFFFFAAFA=F/F6/


# RNA end decoded fastq
@MN00185:343:000H532W5:1:11102:14470:1060|BC3_68-BC2_89-BC1_46-TGACGCGGTACACGTT-lib1#GTAAGGGTGAAG
AATGGCAGAATGTTTTGGTCAAAATTTTTTAGCCAGCACCCCAATTAAAAAAAAAAAAAAAAAA
+
FAFFFF==AFFFFFAFAFFAAF6=AFFFFFFA=F=/FAF=FFF=A=FFFF=FFFFF=/FF=//6
@MN00185:343:000H532W5:1:11102:23931:1064|BC3_47-BC2_8-BC1_20-CAATTTCCAACACGAG-lib1#ATGGCATAAAGT
AAGCTTTAGGAACAAAGAGCATGGATAGAGTTACGATAGAGTAGGTTAGCATAAAAACTACTGCC
+
F//=F=/AF//FFFFFFFF/FF/=F=A=A/=6F//F/FF//FA=///F/AAFFFF=F=FFFFF=F

Sample stats_log:

// Some code
2023-01-02 02:09:58,776 INFO     ---- Start ----
2023-01-02 02:10:02,958 INFO     ---- read in 10X BC ----
2023-01-02 02:10:09,158 INFO     ---- finish constructing barcode ----
2023-01-02 02:13:39,927 INFO     total lines in file /input/data/lib5_S5_R1_001.fastq.gz : 4016146
2023-01-02 02:13:39,929 INFO     incorrect 10x barcode (not perfect match) :134796
2023-01-02 02:13:39,929 INFO     reads incomplete CB :1153844
2023-01-02 02:13:39,929 INFO     Final DNA reads :2227173
2023-01-02 02:13:39,929 INFO     Final RNA reads :116670
2023-01-02 02:13:39,929 INFO     ---- finished ----

Reads cleaning

Based on our experimental design, it’s possible that we introduced some artificial sequences into the library even if the reads could have perfect matched three cell barcodes and 10x barcodes. We will remove these artifacts using cutadapt

Reads cleaning program

In Snakemake pipeline we used the following command to remove known artifacts from the demultiplexed reads:

// For DNA reads
cutadapt -a 'A{{20}}' -a 'G{{20}}' -m 20 -j {threads} -o {output.trimed_DNA} {input.decode_DNA}

// For RNA reads
cutadapt -a CGAGGAGCGCTT -a 'A{{20}}N{{20}}' -q 15 -m 20 -j {threads} -o {output.trimed_RNA} {input.decode_RNA}

Mapping DNA and RNA ends

After raw reads demultiplexing and reads cleaning, we will map DNA ends and RNA ends to reference genome individually. We chose bowtie2 for DNA ends genome mapping and BWA to map RNA ends to genome in a splice aware manner.

Prior to mapping, it is necessary to generate the indexes for both bowtie2 and BWA. The reference fasta file and gtf file are used as input for index construction. In order to ensure software compatibility and avoid potential issues, we have opted to build the indexes within the snakemake pipeline, rather than relying on pre-built indexes provided by users. Although this approach may require additional time for index generation, it helps to streamline the workflow and minimize compatibility concerns.

Building index program

Type Description
Input
  1. reference genome fasta file.
  2. reference genome gtf file.
Output
  1. bowtie2 index
  2. BWA index

The command to build index in Snakemake pipeline:

// bowtie2
bowtie2-build --threads {threads} {input.fasta_file} {params.basename}

// BWA
bwa index {input.fasta_file} -p {params.basename_bwa}

Reads mapping program

Type Description
Input Cleaned DNA/RNA reads in fastq format.
Output Sorted bam file for DNA and RNA respectively.

The command in Snakemake pipeline:

# DNA mapping and sorting using bowtie2
bowtie2 -p {threads} -t --phred33 -x {params.index_prefix} -U {input.DNA_fq} 2> {output.human_bowtie2_stats}| samtools view -bq 20 -F 4 -F 256 - | samtools sort -@ 10 -m 64G -O bam -o {output.DNA_human_bam}

# RNA reads mapping and bam file sorting
bwa mem -t {threads} -SP5M {params.index_prefix} {input.non_rRNA_fq} | sambamba view -S -t {threads} -h -f bam -F "mapping_quality >= 1 and not (unmapped or secondary_alignment) and not ([XA] != null or [SA] != null)" /dev/stdin | sambamba sort -o {output.BWA_human_bam} /dev/stdin

samtools index {output.BWA_human_bam}

The reads mapping, bam file sorting and indexing are using multi threads to accelerate. User only need to specify the total cpu that is available and all details will be taken care of by Snakemake.

Reads deduplication

After reads mapping, the next step is to remove PCR duplicates. PCR duplicates are often recognized by having the same UMI (unique molecular identifier). Deduplication is often done after reads mapping as PCR duplicates are likely to map to the same location on the genome, at the same time, the sequencing errors in the reads are often taken care of by any alignment software.

Deduplicate reads based on the mapping co-ordinate and the UMI attached to the read is a hot topic and has been extensively discussed by many, including umitools. In our case, we wrote our own customized tool to solve this problem (given our UMI sequence is recorded in the read name that is very different from standard input).

The idea is that we sort the bam file based on the mapping coordinates and we scan the bam file once. For any read that 1) maps to the location within 8bp from the previous record, 2) sharing the same cell barcode and molecular barcode, 3) and its UMI is less than 2bp levenshitein distance away from the previous read’s UMI will be marked as a duplicate. All PCR duplicates are removed in the downstream analysis.

Dedup program

In Snakemake pipeline, the dedup step is as follows:

// Dedup step
python3 script/2____Remove_bam_duplicates.py -inbam {input.sorted_bam} -outbam {output.dedupped_bam}

We will dedup both DNA and RNA bam files.

Sample Input (only showing the first 20 rows):

A00953:623:HHY3CDSX3:3:1150:16785:30874|BC3_60-BC2_10-BC1_31-CCGATCTCACATTCGA-lib1#TAGGGGGCTTTG	0	chr1	31681	22	63M	*	0	0	TCACTCCAGGCTGGGCGACAGAGAGAGACCCTGTCTCAGAAAAAAAAAAAAAAGTAATTTGTA	FFFFFF:FF:F,F,FFFFFFFF,FFFFFF,FF:F:FF:FFFF,FFFF:FFFFF:,:,::,F,F	AS:i:-8	XS:i:-31	XN:i:0	XM:i:2	XO:i:0	XG:i:0	NM:i:2	MD:Z:0G55C6	YT:Z:UU
A00953:623:HHY3CDSX3:3:2659:30490:37043|BC3_72-BC2_4-BC1_24-TCATGTTAGGACATCG-lib1#ATGATCTAGTCA	16	chr1	55802	24	63M	*	0	0	AGCAGGACTTTGTCGTGTTCACCTCTATCACATCATACATATAGCAAACAGTAAAACTATTTA	F,:FF::FFFFFF:FFFFFFFFFFFFFFFFFFF:FFF:FFFFFFFFFFFF:FFFFFFF:FF,F	AS:i:-12	XN:i:0	XM:i:3	XO:i:0	XG:i:0	NM:i:3	MD:Z:37A23G0C0	YT:Z:UU
A00953:623:HHY3CDSX3:3:1111:16459:1626|BC3_42-BC2_65-BC1_32-ATTACTCAGCGCTGAA-lib1#TTTTGGGATTAC	0	chr1	56275	30	64M	*	0	0	ATTCTCCCTCATTCATTTCAATACGCTGTTCGGCCTGCTACCCCAGTTTCCCACTTAGAACAAT	,:FFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFF,FFFF	AS:i:0	XS:i:-5	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:64	YT:Z:UU
A00953:623:HHY3CDSX3:3:1349:7717:8625|BC3_42-BC2_65-BC1_32-ATTACTCAGCGCTGAA-lib1#TTTTGGGATTAC	0	chr1	56275	30	64M	*	0	0	ATTCTCCCTCATTCATTTCAATACGCTGTTCGGCCTGCTACCCCAGTTTCCCACTTAGAACAAT	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:,FFFF	AS:i:0	XS:i:-5	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:64	YT:Z:UU
A00953:623:HHY3CDSX3:3:1520:12599:10128|BC3_42-BC2_65-BC1_32-ATTACTCAGCGCTGAA-lib1#TTTTGGGATTAC	0	chr1	56275	30	64M	*	0	0	ATTCTCCCTCATTCATTTCAATACGCTGTTCGGCCTGCTACCCCAGTTTCCCACTTAGAACAAT	,,FFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFF:FF,FFFF	AS:i:0	XS:i:-5	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:64	YT:Z:UU
A00953:623:HHY3CDSX3:3:2561:10854:9330|BC3_42-BC2_65-BC1_32-ATTACTCAGCGCTGAA-lib1#TTTTGGGATTAC	0	chr1	56275	30	64M	*	0	0	ATTCTCCCTCATTCATTTCAATACGCTGTTCGGCCTGCTACCCCAGTTTCCCACTTAGAACAAT	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFF:FFFFFFFF	AS:i:0	XS:i:-5	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:64	YT:Z:UU
A00953:623:HHY3CDSX3:3:1410:31693:14544|BC3_52-BC2_92-BC1_41-ATTCCATGTCGATTTG-lib1#AAGATCAACCGT	0	chr1	61473	40	63M	*	0	0	TCAATTTGTTTACCATTATTACTCTTGGTATTTTTAAGAAAAGTCTTTCAATTGTTATTATAA	FFFFFFFFFFFF,FFF:FFFFFFFF:FFFFFFFFFFFFF:FFFFFF:FF:FFFFFFFFFFFFF	AS:i:-9	XN:i:0	XM:i:2	XO:i:0	XG:i:0	NM:i:2	MD:Z:0G48C13	YT:Z:UU
A00953:623:HHY3CDSX3:3:2101:15212:13902|BC3_52-BC2_92-BC1_41-ATTCCATGTCGATTTG-lib1#AAGATCAACCGT	0	chr1	61473	24	63M	*	0	0	TCAATTTGTTTACCATTATTACTCTTGGTATTTTTAAGAAAAGTCTGTCCATTGTTCTTATAA	FFFF,FFF,F:F:FF:F:FFFFF,FFFFF:F,FF:FFF,::FFFF:,::,:,,:F,:,FF:FF	AS:i:-12	XN:i:0	XM:i:3	XO:i:0	XG:i:0	NM:i:3	MD:Z:0G45T9A6	YT:Z:UU
A00953:623:HHY3CDSX3:3:1507:31159:31344|BC3_37-BC2_69-BC1_1-TCGATTTAGAGGATCC-lib1#GTACAAATTGGC	0	chr1	62417	30	64M	*	0	0	TCACGTGTTGGTTTGGGGTAGATCATTATAGGCACATGTAGGAAACAGCTTTCAGAGATGCCTT	FFFFFFF:FF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F:FFFFFFFF::FFF:FFFF:F:	AS:i:0	XS:i:-5	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:64	YT:Z:UU
A00953:623:HHY3CDSX3:3:2233:19307:4586|BC3_37-BC2_69-BC1_1-TCGATTTAGAGGATCC-lib1#GTACAAATTGGC	0	chr1	62417	30	64M	*	0	0	TCACGTGTTGGTTTGGGGTAGATCATTATAGGCACATGTAGGAAACAGCTTTCAGAGATGCCTT	FFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F:FFF:FF	AS:i:0	XS:i:-5	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:64	YT:Z:UU
A00953:623:HHY3CDSX3:3:2410:24876:20776|BC3_37-BC2_69-BC1_1-TCGATTTAGAGGATCC-lib1#GTACAAATTGGC	0	chr1	62417	30	64M	*	0	0	TCACGTGTTGGTTTGGGGTAGATCATTATAGGCACATGTAGGAAACAGCTTTCAGAGATGCCTT	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFF:FFF	AS:i:0	XS:i:-5	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:64	YT:Z:UU
A00953:623:HHY3CDSX3:3:2526:13476:6637|BC3_37-BC2_69-BC1_1-TCGATTTAGAGGATCC-lib1#GTACAAATTGGC	0	chr1	62417	30	64M	*	0	0	TCACGTGTTGGTTTGGGGTAGATCATTATAGGCACATGTAGGAAACAGCTTTCAGAGATGCCTT	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F,FFFFFFFFFF:FFFFFFFFFFF:FFFFFFFF	AS:i:0	XS:i:-5	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:64	YT:Z:UU
A00953:623:HHY3CDSX3:3:2560:15483:1564|BC3_37-BC2_69-BC1_1-TCGATTTAGAGGATCC-lib1#GTACAAATTGGC	0	chr1	62417	30	64M	*	0	0	TCACGTGTTGGTTTGGGGTAGATCATTATAGGCACATGTAGGAAACAGCTTTCAGAGATGCCTT	FFFFF,F,FFFF:,FF:F:FFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFF:F,FFFF	AS:i:0	XS:i:-5	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:64	YT:Z:UU
A00953:623:HHY3CDSX3:3:2408:20202:14340|BC3_4-BC2_37-BC1_28-TGTTTGTCAATAGTGA-lib1#AGACTCTAGCTT	16	chr1	64627	40	65M	*	0	0	TGCACATGTACCCTAGAACTTAAAGTATAATAAAAAAAAATAGACTCTAGGACTCTGTATTATGA	,FFFF,F,FFFF:FF:FFFFF:,FFFF,FF::FF:FFFFFFFFFFFFF,F,FFFF::FFFFF,FF	AS:i:-8	XN:i:0	XM:i:2	XO:i:0	XG:i:0	NM:i:2	MD:Z:50T13C0	YT:Z:UU
A00953:623:HHY3CDSX3:3:2545:21377:2754|BC3_4-BC2_37-BC1_28-TGTTTGTCAATAGTGA-lib1#AGACTCTAGCTT	16	chr1	64627	40	65M	*	0	0	TGCACATGTACCCTCGAACTTAAAGTATAATAAAAAAAAATAGACTCTAGGACTCTGTATTATGA	,FFFF,F:,::FF,,,:FFF:,:F:FF::FFF:FFFFFFF:FFFF:FFF,,FFFF,FFF:FF::F	AS:i:-11	XN:i:0	XM:i:3	XO:i:0	XG:i:0	NM:i:3	MD:Z:14A35T13C0	YT:Z:UU
A00953:623:HHY3CDSX3:3:1342:2555:31281|BC3_56-BC2_78-BC1_13-AGGTAGGGTCACTTCC-lib1#TAAAGCAACTTT	16	chr1	65955	30	63M	*	0	0	ATTTCTTTCACATAAAGGTACAAATCATACTGCTAGAGTTGTGAGGATTTTTACAGCTTTTGA	FFFFFF:FFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFF,FFFF	AS:i:0	XS:i:-5	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:63	YT:Z:UU
A00953:623:HHY3CDSX3:3:1365:15872:24972|BC3_56-BC2_78-BC1_13-AGGTAGGGTCACTTCC-lib1#TAAAGCAACTTT	16	chr1	65955	30	63M	*	0	0	ATTTCTTTCACATAAAGGTACAAATCATACTGCTAGAGTTGTGAGGATTTTTACAGCTTTTGA	FFF:FFFF,F:FF,FFF:::FFFFFFFFFFFFFFFFFFFF::FFFF:FFFFFFFFFFFFFFFF	AS:i:0	XS:i:-5	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:63	YT:Z:UU
A00953:623:HHY3CDSX3:3:1375:25536:31469|BC3_56-BC2_78-BC1_13-AGGTAGGGTCACTTCC-lib1#TAAAGCAACTTT	16	chr1	65955	30	63M	*	0	0	ATTTCTTTCACATAAAGGTACAAATCATACTGCTAGAGTTGTGAGGATTTTTACAGCTTTTGA	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	AS:i:0	XS:i:-5	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:63	YT:Z:UU
A00953:623:HHY3CDSX3:3:1411:10059:26115|BC3_56-BC2_78-BC1_13-AGGTAGGGTCACTTCC-lib1#TAAAGCAACTTT	16	chr1	65955	30	63M	*	0	0	ATTTCTTTCACATAAAGGTACAAATCATACTGCTAGAGTTGTGAGGATTTTTACAGCTTTTGA	FFFFF,,FFFFF:,,FFFFFFF:FFFFFFFFFFFFFFF:FFFFFFFFFFFFF:FFFFFFFFFF	AS:i:0	XS:i:-5	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:63	YT:Z:UU
A00953:623:HHY3CDSX3:3:1423:7148:17879|BC3_56-BC2_78-BC1_13-AGGTAGGGTCACTTCC-lib1#TAAAGCAACTTT	16	chr1	65955	30	63M	*	0	0	ATTTCTTTCACATAAAGGTACAAATCATACTGCTAGAGTTGTGAGGATTTTTACAGCTTTTGA	FFF,FFFFFFFFFF:FFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFF	AS:i:0	XS:i:-5	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:63	YT:Z:UU

Sample output:

A00953:623:HHY3CDSX3:3:1150:16785:30874|BC3_60-BC2_10-BC1_31-CCGATCTCACATTCGA-lib1#TAGGGGGCTTTG	0	chr1	31681	22	63M	*	0	0	TCACTCCAGGCTGGGCGACAGAGAGAGACCCTGTCTCAGAAAAAAAAAAAAAAGTAATTTGTA	FFFFFF:FF:F,F,FFFFFFFF,FFFFFF,FF:F:FF:FFFF,FFFF:FFFFF:,:,::,F,F	AS:i:-8	XS:i:-31	XN:i:0	XM:i:2	XO:i:0	XG:i:0	NM:i:2	MD:Z:0G55C6	YT:Z:UU
A00953:623:HHY3CDSX3:3:2659:30490:37043|BC3_72-BC2_4-BC1_24-TCATGTTAGGACATCG-lib1#ATGATCTAGTCA	16	chr1	55802	24	63M	*	0	0	AGCAGGACTTTGTCGTGTTCACCTCTATCACATCATACATATAGCAAACAGTAAAACTATTTA	F,:FF::FFFFFF:FFFFFFFFFFFFFFFFFFF:FFF:FFFFFFFFFFFF:FFFFFFF:FF,F	AS:i:-12	XN:i:0	XM:i:3	XO:i:0	XG:i:0	NM:i:3	MD:Z:37A23G0C0	YT:Z:UU
A00953:623:HHY3CDSX3:3:1111:16459:1626|BC3_42-BC2_65-BC1_32-ATTACTCAGCGCTGAA-lib1#TTTTGGGATTAC	0	chr1	56275	30	64M	*	0	0	ATTCTCCCTCATTCATTTCAATACGCTGTTCGGCCTGCTACCCCAGTTTCCCACTTAGAACAAT	,:FFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFF,FFFF	AS:i:0	XS:i:-5	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:64	YT:Z:UU
A00953:623:HHY3CDSX3:3:1410:31693:14544|BC3_52-BC2_92-BC1_41-ATTCCATGTCGATTTG-lib1#AAGATCAACCGT	0	chr1	61473	40	63M	*	0	0	TCAATTTGTTTACCATTATTACTCTTGGTATTTTTAAGAAAAGTCTTTCAATTGTTATTATAA	FFFFFFFFFFFF,FFF:FFFFFFFF:FFFFFFFFFFFFF:FFFFFF:FF:FFFFFFFFFFFFF	AS:i:-9	XN:i:0	XM:i:2	XO:i:0	XG:i:0	NM:i:2	MD:Z:0G48C13	YT:Z:UU
A00953:623:HHY3CDSX3:3:1507:31159:31344|BC3_37-BC2_69-BC1_1-TCGATTTAGAGGATCC-lib1#GTACAAATTGGC	0	chr1	62417	30	64M	*	0	0	TCACGTGTTGGTTTGGGGTAGATCATTATAGGCACATGTAGGAAACAGCTTTCAGAGATGCCTT	FFFFFFF:FF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F:FFFFFFFF::FFF:FFFF:F:	AS:i:0	XS:i:-5	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:64	YT:Z:UU
A00953:623:HHY3CDSX3:3:2408:20202:14340|BC3_4-BC2_37-BC1_28-TGTTTGTCAATAGTGA-lib1#AGACTCTAGCTT	16	chr1	64627	40	65M	*	0	0	TGCACATGTACCCTAGAACTTAAAGTATAATAAAAAAAAATAGACTCTAGGACTCTGTATTATGA	,FFFF,F,FFFF:FF:FFFFF:,FFFF,FF::FF:FFFFFFFFFFFFF,F,FFFF::FFFFF,FF	AS:i:-8	XN:i:0	XM:i:2	XO:i:0	XG:i:0	NM:i:2	MD:Z:50T13C0	YT:Z:UU
A00953:623:HHY3CDSX3:3:1342:2555:31281|BC3_56-BC2_78-BC1_13-AGGTAGGGTCACTTCC-lib1#TAAAGCAACTTT	16	chr1	65955	30	63M	*	0	0	ATTTCTTTCACATAAAGGTACAAATCATACTGCTAGAGTTGTGAGGATTTTTACAGCTTTTGA	FFFFFF:FFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFF,FFFF	AS:i:0	XS:i:-5	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:63	YT:Z:UU

Final output and summary report

Merge bam files from individual library

Remember that the final library is splitted into 8 aliquots (8 libs) before sequencing to further increasing the combinations of molecular barocdes. As the PCR step is separately applied to each aliquots, our dedup pipeline is also applied individually.

After these steps, we merged all 8 libs into one as our final output. The final out hence is in bam format that records the cell barcode, molecular barcode, insert mapping location for DNA and RNA individually (two separate files). Notice that the bam file is also sorted, so you can see reads from different libs can be seen in the final output file.

>samtools view merge_DNA_human.sort.bam|head -n 10
A00953:623:HHY3CDSX3:3:1442:21585:7059|BC3_39-BC2_35-BC1_20-ACAACCATCCGGGACT-lib8#TCAAAAGTGCAC	0	chr1	17455	23	64M	*	0	0	TCACTGTGGGGTCCCAGGCCTCCCGGGCCGAGCCACCCGTCACCCCCTGGCTCCTGGCGGATGT	:FF:FF:,:,F:FFFFFFFFFF,F:,F:F,:FFF,FFFFFFFFF:FF,,FF:F:FF:F:FFF:,	AS:i:-16	XN:i:0	XM:i:4	XO:i:0	XG:i:0	NM:i:4	MD:Z:0G24A32C0T4	YT:Z:UU
A00953:623:HHY3CDSX3:3:2642:19705:24314|BC3_38-BC2_39-BC1_50-AAACGAACATACAGGG-lib2#AAGATAAATACG	16	chr1	31533	40	64M	*	0	0	TATTGTGAGATTCCATCTCTACAAAAATAAAATTAAATAGCCAGTCATGGTGTCACACACCTGA	FFF,F,FFFF,FFFFFFF:FFFF:FFF,,FF:F,F,FFFF,:FF:FFFFFFFFFFFFFFF:,FF	AS:i:-8	XN:i:0	XM:i:2	XO:i:0	XG:i:0	NM:i:2	MD:Z:33A29T0	YT:Z:UU
A00953:623:HHY3CDSX3:3:1150:16785:30874|BC3_60-BC2_10-BC1_31-CCGATCTCACATTCGA-lib1#TAGGGGGCTTTG	0	chr1	31681	22	63M	*	0	0	TCACTCCAGGCTGGGCGACAGAGAGAGACCCTGTCTCAGAAAAAAAAAAAAAAGTAATTTGTA	FFFFFF:FF:F,F,FFFFFFFF,FFFFFF,FF:F:FF:FFFF,FFFF:FFFFF:,:,::,F,F	AS:i:-8	XS:i:-31	XN:i:0	XM:i:2	XO:i:0	XG:i:0	NM:i:2	MD:Z:0G55C6	YT:Z:UU
A00953:623:HHY3CDSX3:3:2504:26106:4993|BC3_2-BC2_25-BC1_2-AGGTTGTTCTGGACCG-lib2#AAGTCATACTAG	16	chr1	33708	22	65M	*	0	0	GATGTTCTCTGCCCCACGTCCAAGCGTTCTCATTGTTCAATTCCCACCTGTGAGTGAGAACATGA	FFFFFF:FF:FFF::FFF::F,FF:FFFFFFFFFFFFFFF,FFF:FFFFFFFFFFFF:FFF:FFF	AS:i:-10	XS:i:-37	XN:i:0	XM:i:2	XO:i:0	XG:i:0	NM:i:2	MD:Z:16T47C0	YT:Z:UU
A00953:623:HHY3CDSX3:3:1572:6524:34585|BC3_69-BC2_36-BC1_48-AGAACCTCATTGCTGA-lib5#GATACTTGATTG	16	chr1	33710	22	63M	*	0	0	TGTTCTCTGCCACATGTCCAAGCGTTCTCATTGTTCAATTCCCACCTGTGAGTGAGAACATGA	FFFFFF:FFFF,F:FFFF:FFFFF:FFFFF,FFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFF	AS:i:-8	XS:i:-32	XN:i:0	XM:i:2	XO:i:0	XG:i:0	NM:i:2	MD:Z:11C50C0	YT:Z:UU
A00953:623:HHY3CDSX3:3:1147:4933:22842|BC3_89-BC2_87-BC1_40-AACAGGGAGTTACGTC-lib7#TAAGGTTTTTTG	0	chr1	35357	40	62M	*	0	0	TCCCAAGGTCGGCCAGGTGCAGTGGCTCATGTCTATAATCCCAACACTTTGGGAGGCTAAGG	F:FFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF	AS:i:-10	XN:i:0	XM:i:2	XO:i:0	XG:i:0	NM:i:2	MD:Z:0G57G3	YT:Z:UU
A00953:623:HHY3CDSX3:3:2639:25201:8030|BC3_85-BC2_25-BC1_59-ACTTCCGCAACACACT-lib4#TTGCACCAGACC	0	chr1	46640	40	62M	*	0	0	TCCTTGAAAGTCTTAACAATTTTTTTAACCAAAGTCCTCACAAAGTCAGTTTACATTAGCCC	,:FFF:F,F,,,F,F,FF:FFFFF,F:FFF,FFFFF:FFF::FF,F:FFFFFFF,FFF:FF:	AS:i:-9	XN:i:0	XM:i:3	XO:i:0	XG:i:0	NM:i:3	MD:Z:0G8T34T17	YT:Z:UU
A00953:623:HHY3CDSX3:3:2515:30273:18756|BC3_69-BC2_90-BC1_25-GAGCCTGAGCACACAG-lib3#AATCTCCACAGC	16	chr1	46642	24	63M	*	0	0	CTTGAAATTATTAAAAATTTTTTTAACCAAAGTCCTCACAAATTAAGTTTACATTAGCCCTGA	FF:,:::F:,,FFF,,F:FFF,FFFF,,FF::F,F:F:,,,,FF,F,FFF:F,FFFF:,,F,F	AS:i:-14	XN:i:0	XM:i:4	XO:i:0	XG:i:0	NM:i:4	MD:Z:9C4C29C17C0	YT:Z:UU
A00953:623:HHY3CDSX3:3:2426:14470:19758|BC3_28-BC2_2-BC1_55-GATCAGTTCACTCACC-lib2#ATCCTTCATGTG	0	chr1	49050	22	64M	*	0	0	TCATGAGAATGTTCATTGCAGCACTCCTCACAATAGCAGAGACATGGAATCAACTTAAATGCCC	FFF,FFFFFFFFFFF:FFFFFFFFF,F,FFF:FF,:FFFFFFFFFFFFFFFFFFF,FF:FFFFF	AS:i:-8	XS:i:-33	XN:i:0	XM:i:2	XO:i:0	XG:i:0	NM:i:2	MD:Z:0G24A38	YT:Z:UU
A00953:623:HHY3CDSX3:3:2133:13548:8046|BC3_51-BC2_11-BC1_32-AGGACTTAGTCTTCCC-lib3#TAATTGCCAATA	16	chr1	51853	23	12M1D51M	*	0	0	AGACTCCGCCTCAAAAAAAAAAAAGAAGATTGATCCGAGAGTACCTCCCCTAAGGGTACATGA	F:F:FFFF,:FFFF::FFFF:FFFFF:F,:FF:FF,:F:::,FF:FF,FF,,FFFFFFFFFFF	AS:i:-16	XN:i:0	XM:i:2	XO:i:1	XG:i:1	NM:i:3	MD:Z:12^A23A26C0	YT:Z:UU

The bam output is compressed and can be easily imported and processed by widely used packages in R or python

Summary report

MUSIC is designed as a single-cell single molecule method. It is highly useful for users to get an idea of the per cell reads, per molecular cluster reads quantity and distribution. To this end, we derived the following summary files:

File names Description
merge_DNA(RNA)_human.sort.cell_clusters.csv Number of DNA(RNA) clusters each cell have. Two columns record cell barcode and number of cluster.
merge_DNA(RNA)_human.sort.cell_reads.csv Number of DNA(RNA) reads each cell have. Two columns record cell barcode and number of cluster.
merge_DNA(RNA)_human.sort.clusters_size.csv Number of DNA(RNA) reads in each molecular complex. Three columns: 1) cell barcode, 2) molecular barcode, 3) number of DNA reads in this barcode complex.
merge_human.sort.cluster_rna_dna.csv Number of DNA and RNA reads in each molecular complex. Five columns: 1) Cell barcode, 2) molecular barcode, 3) number of DNA reads in this molecular complex, 4) number of RNA reads in this molecular complex. 5) Total reads (DNA+RNA).