Ranbow: a haplotype assembler for polyploid genomes

Ranbow is a haplotype assembler for polyploid genomes. It has been developed for the haplotype assembly of the hexaploid sweet potato genome, which is highly heterozygous. Ranbow can also be applied to other polyploid genomes. After a first phasing, Ranbow utilizes the assembled haplotypes to improve the accuracy of variant calling results and to infer the evolutionary history of the organism´s genome. Ranbow has three main modes of function:

ranbow hap: for haplotyping
ranbow eval: for evaluating of the assemble haplotypes by gold standard (long) reads
ranbow phylo: for the phylogenetic analysis

Availability

You can download the source code and the toy example files here.

Ranbow - Polyploid Haplotype Assembler

The video explains how you can simply run Ranbow on the toy example.

https://www.youtube.com/watch?v=2zZ5IfpuGZQ

Note that this video is made for previous version of Ranbow. However, the basic principle between both versions is the same. The only difference for users is that in previous version the haplotypes reconstructed from short reads and paired-end reads are reported separately, while in the new version we consider paired-end read as one read, and report the final haplotypes.

Dependencies

The code is implemented in python 2.7.13 and can be run from linux shell. To run Ranbow the following libraries and tools should be available:

python libraries:

pysam 0.10.0
numpy 1.12.1

tool:

samtools 0.1.19-44428cd.

Data preparation

We generated a random list of scaffolds from sweet potato genome in order to make a data input as a toy example. Then, the fasta, bam and vcf files were extracted. Data files for selected scaffolds are named as ‘sel.bam’, ’sel.vcf’, and ‘sel.fasta’. These files are generated by “toy_data.sh” which is available in toy folder as well.

It worth to mention that Ranbow has the feature of being applied on the whole or selected region of the scaffolds. This feature helps for better parallelization as well as phasing targeted interval in the genome.

Ranbow hap

Ranbow hap has three modes of sub-function, for simplicity we refer to them as different modes, namely index, hap, and collect. The mode of choice is declared by -mode parameter in command line.

python2_7_13 ranbow.py hap -mode index python2_7_13 ranbow.py hap -mode hap python2_7_13 ranbow.py hap -mode collect python2_7_13 ranbow.py hap -mode modVCF

The “index” mode generates the index files in order to have random access to the chromosomes and scaffolds in case of for parallelization. In “hap” mode, Ranbow construct haplotypes from mapped reads. The “collect” mode collects the haplotypes assembled with different processors and generates the final fasta files, bam files ( aligned haplotypes) and also hap formated file . The bam files are already sorted and indexed and are ready to be loaded in IGV browser for downstream analysis. The hap formatted files include haplotypes information such as name, start position, haplotype sequence, number of error correction, haplotype quality, and supporting fragments. These fields are explained in more details as follows.

Haplotype name: The name includes scaffold/chromosome name, block index and file open index

Haplotype sequence: sequence of alleles in which the alleles are coded to numbers in [0, ploidy) interval. For sweet potato they are coded with “0” to “5”.

Number of error correction: Each haplotype has a number of supporting fragments. This field represents the number of mismatch among these fragments.

Quality of haplotype: The supporting fragment coverage of each position in the haplotype. Length of quality equals to the length of haplotype.

Supporting fragments: A full list of supporting fragments.

To run Ranbow, the parameters in the following table need to be adjusted. Some of these parameters have to be passed from the command line (mentioned in following table, third column), while the others could also be passed via parameter file as well. The compulsory parameters are as follows:

Parameter	Compulsory on modes	Command line and/or parameter file	Description
`-mode`	Index, hap, collect, and modVCF	C	This parameter indicates the mode function
`-par`	Index, hap, collect, and modVCF	C	Most of the parameter can be adjusted in this file and be passed by this parameter
`-ploidy`	hap, modVCF	CF	The ploidy of the organism
`-noProcessor`	hap	CF	Number of available processors
`-bamFile`	Index, hap	CF	Sorted bam file of reads
`-refFile`	Index, hap	CF	Fasta file
`-vcfFile`	Index, hap, and modVCF	CF	Vcf file
`-selectedScf`	hap	CF	The haplotyping can be run on a group of scaffolds. This file indicates list of scaffolds and/or regions that are going to be haplotyped.
`-outputFolderBase`	Index, hap, collect, modVCF	CF	The result of haplotyping will be collected in the following folder.
`-processorIndex`	hap	C	It is possible to run Ranbow on different cores in one machine and also in different machines. All scaffolds are distributed to different sets according to their sizes. This assignment is done to minimize the maximum running time of all individual processors. Each set is assigned to one processor. -processorIndex is an index referring to one processor and should be in the following range: [0, -noProcessor)

*C: command line, F: parameter file

Here is the content of parameter file used in this toy example:

-ploidy 6 -noProcessor 4 -bamFile path to folder~>RANBOW/toy/sel.bam -refFile path to folder~>RANBOW/toy/sel.fasta -vcfFile path to folder~>RANBOW/toy/sel.vcf -selectedScf path to folder~>RANBOW/toy/scaffolds.list -outputFolderBase path to folder~>/RANBOW/toy/result

For simplicity, data files are stored in one folder (named RANBOW/toy) and the all parameters are adjusted in hap.params file. The parameter file is passed to through the “-par” argument.

python2_7_13 ranbow.py hap -par RANBOW/toy/hap.params

So here is the list of files in the folder:

RANBOW/toy > ls -lh total 30M -rw-r----- 1 moeinzad moeinzad 426 Mar 31 10:33 hap.params -rw-r----- 1 moeinzad moeinzad 641 Mar 30 09:51 scaffolds.list -rw-r----- 1 moeinzad moeinzad 29M Mar 30 11:21 sel.bam -rw-r----- 1 moeinzad moeinzad 133K Mar 30 11:21 sel.fasta -rw-r----- 1 moeinzad moeinzad 1.1M Mar 30 11:21 sel.vcf -rw-r----- 1 moeinzad moeinzad 405 Mar 30 11:18 toy_data.sh

Indexing inputs - bam, fasta, and vcf files (mode: index)

It generates index files for bam, vcf, and fasta if not exist:

python2_7_13 ranbow.py hap -mode index -par RANBOW/toy/hap.params

Then the index files are generated and listed as follows:

RANBOW/toy > ls -lht total 30M drwxr-x--- 2 moeinzad moeinzad 10 Mar 31 10:58 result -rw-r----- 1 moeinzad moeinzad 844 Mar 31 10:58 sel.vcf.index -rw-r----- 1 moeinzad moeinzad 225K Mar 31 10:58 sel.bam.bai -rw-r----- 1 moeinzad moeinzad 1.2K Mar 31 10:58 sel.fasta.fai-rw-r----- 1 moeinzad moeinzad 426 Mar 31 10:33 hap.params -rw-r----- 1 moeinzad moeinzad 29M Mar 30 11:21 sel.bam -rw-r----- 1 moeinzad moeinzad 1.1M Mar 30 11:21 sel.vcf -rw-r----- 1 moeinzad moeinzad 133K Mar 30 11:21 sel.fasta -rw-r----- 1 moeinzad moeinzad 405 Mar 30 11:18 toy_data.sh -rw-r----- 1 moeinzad moeinzad 641 Mar 30 09:51 scaffolds.list

Running haplotyper (mode: hap)

The -noProcessor should be adjusted according to the number of available processors. For example if the code is running on a cluster with 200 cores, -noProcessor can be set to 200. Then, 200 independent jobs will be executed with different set of scaffolds. These 200 jobs are independent and can be run on different machines as well. For the sake of simplicity in the toy example, we run the code utilizing 3 cores.

The -processorIndex is a compulsory parameter in command line if the number of processors is more than 1 (-noProcessor > 1). Otherwise -noProcessor is set to 1 and -processorIndex is set to 0 by default , meaning that the output is generated with one processor and the result is going to be generated in a folder named 0. Here the standard output for the first processor (-processorIndex = 0 and -noProcessor = 3) is shown as follows.

python2_7_13 ranbow.py hap -mode hap -par hap.params -processorIndex 0

('scaffold6228|size14266', 1, 14266, 'scaffold6228|size14266')

Run ranbow - single mode

scaffold6228|size14266 readBAM 0:00:04.010655 Ranbow_single 0:00:02.986332 single2file 0:00:00.060936

('scaffold12138|size4302', 1, 4302, 'scaffold12138|size4302')

Run ranbow - single mode

scaffold12138|size4302 readBAM 0:00:02.606135 Ranbow_single 0:00:00.253924 single2file 0:00:00.031988

('scaffold15599|size3549', 1, 3549, 'scaffold15599|size3549')

Run ranbow - single mode

scaffold15599|size3549 readBAM 0:00:00.834673 Ranbow_single 0:00:00.043864 single2file 0:00:00.006803

('scaffold18910|size3076', 1, 3076, 'scaffold18910|size3076')

Run ranbow - single mode

scaffold18910|size3076 readBAM 0:00:02.195878 Ranbow_single 0:00:00.220671 single2file 0:00:00.027786

('scaffold22064|size2651', 1, 2651, 'scaffold22064|size2651')

Run ranbow - single mode

scaffold22064|size2651 readBAM 0:00:01.230129 Ranbow_single 0:00:00.155012 single2file 0:00:00.017257

('scaffold25118|size2247', 1, 2247, 'scaffold25118|size2247')

Run ranbow - single mode

scaffold25118|size2247 readBAM 0:00:03.558237 Ranbow_single 0:00:00.663766 single2file 0:00:00.047705

('scaffold27129|size1948', 1, 1948, 'scaffold27129|size1948')

Run ranbow - single mode

scaffold27129|size1948 readBAM 0:00:01.748883 Ranbow_single 0:00:00.245888 single2file 0:00:00.018874

('scaffold29117|size1608', 1, 1608, 'scaffold29117|size1608')

Run ranbow - single mode

scaffold29117|size1608 readBAM 0:00:00.878690 Ranbow_single 0:00:00.033398 single2file 0:00:00.006976

The running time for each part of haplotyping is reported in the standard output. The haplotyping result will be appear in the subfolder of -outputFolderBase named as -processorIndex. For example, since we generated haplotypes for -processorIndex 0 the only generated folder is called 0

RANBOW/toy/result > ls 0

This folder contains four files as follows:

RANBOW/toy/result/0 > ls -lh

total 964K

-rw-rw---- 1 moeinzad moeinzad 470526 Apr 8 15:35 ranbow.single.hap.sam

-rw-rw---- 1 moeinzad moeinzad 324221 Apr 8 15:35 ranbow.single.frag

-rw-rw---- 1 moeinzad moeinzad 234516 Apr 8 15:35 ranbow.single.hap.fasta

-rw-rw---- 1 moeinzad moeinzad 42188 Apr 8 15:35 ranbow.single.hap

The hap format shows the connectivity between alleles in segments, the quality of alleles in assembled haplotype (the number supporting reads for the allele), and in general all information regarding the assembled haplotype in coded-allele space. Coded-allele space means the shared sequence between alleles are ignored and the alleles are decoded to numbers. The fasta and the bam files for the aligned haplotypes are generated in this folder as well.

To run the code in parallel on one machine the following command can be executed:

for i in {0..3} do python2_7_13 ranbow.py hap -mode hap -par hap.params -processorIndex $i > $i.log & done

The standard outputs for each processor is collected in “0.log”, “1.log”, and “2.log” files.

It is also possible to run Ranbow partly in one machine and partly in the other. Following our toy example, set 0 (-processorIndex = 0) and set 1 are executed in machine A and set 2 is executed in machine B.

Machine A: for i in {0..1} do python2_7_13 ranbow.py hap -mode hap -par hap.params -processorIndex $i > $i.log & done

Machine B: for i in {2..2} do python2_7_13 ranbow.py hap -mode hap -par hap.params -processorIndex $i > $i.log & done

The three generated folder are listed as follows:

RANBOW/toy/result > ls -lh total 12K drwxr-x--- 2 moeinzad moeinzad 4.0K Mar 31 11:00 0 drwxr-x--- 2 moeinzad moeinzad 4.0K Mar 31 11:03 1 drwxr-x--- 2 moeinzad moeinzad 4.0K Mar 31 11:03 2

Collecting generated data from different processors (mode: collect)

When all jobs get finished, “-mode collect ” can be executed to collect the haplotypes from different machines:

python2_7_13 ranbow.py hap -mode collect -par hap.params number of folders: 3 [samopen] SAM header is present: 28 sequences. [samopen] SAM header is present: 28 sequences.

Then the generated files are as follows:

RANBOW/toy/result > ls -lht

total 1.2M

-rw-rw---- 1 moeinzad moeinzad 172K Apr 8 15:30 ranbow.single.hap.bam

-rw-rw---- 1 moeinzad moeinzad 2.3K Apr 8 15:30 ranbow.single.hap.bam.bai

-rw-rw---- 1 moeinzad moeinzad 330K Apr 8 15:30 ranbow.single.hap

-rw-rw---- 1 moeinzad moeinzad 874K Apr 8 15:30 ranbow.single.hap.fasta

-rw-rw---- 1 moeinzad moeinzad 2.2M Apr 8 15:30 ranbow.single.frag

-rw-rw---- 1 moeinzad moeinzad 47K Apr 8 15:30 ranbow.single.hap.igv.sam

-rw-rw---- 1 moeinzad moeinzad 5.9M Apr 8 15:30 ranbow.single.frag.igv.sam

-rw-rw---- 1 moeinzad moeinzad 3.4K Apr 8 15:30 ranbow.single.hap.igv.fa

-rw-rw---- 1 moeinzad moeinzad 36K Apr 8 15:30 ranbow.single.hap.igvtype.sam

For this toy example the generated single.hap.bam files are ready to be loaded for IGV viewer for further analysis.

Revising sequence variants - vcf file (mode: modVCF)

The following command modifies and corrects the variants with the aid of assembled haplotypes.

hap/RANBOW > mypy ranbow.py hap -mode modVCF -par params.txt modified VCF file: /hap/RANBOW/toy/sel.mod.vcf log of modified SNPs: /hap/RANBOW/toy/result/ranbow.modVCF.log Modified SNPs: 853 Deleted SNPs: 49

The detailed information of the modification can be found in:

/hap/RANBOW/toy/result/ranbow.modVCF.log

Ranbow eval

To evaluate the haplotyping accuracy, we recruit the Roche 454 trimmed reads mapped to the assembly. This file or other gold standard mapped read files has to be passed with -bamFileEval parameter. The Ranbow eval has also this option of being executed in different machine in parallel. For obtaining the evaluation result the following steps needs to be done.

python2_7_13 ranbow.py eval -par hap.params -mode index

mode run in parallel :

python2_7_13 ranbow.py eval -par hap.params -mode run -processorIndex i

mode collect:

python2_7_13 ranbow.py eval -par hap.params -mode collect

the parameters are:

Parameter	Compulsory on modes	Command line and/or parameter file	Description
`-mode`	Index, run, and collect	C	This parameter indicates the mode function
`-par`	Index, run, and collect	C	Most of the parameter can be adjusted in this file and be passed by this parameter
`-ploidy`	run	CF	The ploidy of the organism
`-noProcessor`	run	CF	Number of available processors
`-bamFileEval`	Index, hap	CF	Sorted bam file of gold standard reads
`-refFile`	Index, hap	CF	Fasta file
`-vcfFile`	Index, hap, and modVCF	CF	Vcf file
`-selectedScf`	hap	CF	The haplotyping can be run on a group of scaffolds. This file indicates list of scaffolds and/or regions that are going to be haplotyped.
`-outputFolderBase`	Index, hap, collect, modVCF	CF	The result of haplotyping will be collected in the following folder.
`-processorIndex`	hap	C	It is possible to run Ranbow on different cores in one machine and also in different machines. All scaffolds are distributed to different sets according to their sizes. This assignment is done to minimize the maximum running time of all individual processors. Each set is assigned to one processor. -processorIndex is an index referring to one processor and should be in the following range: [0, -noProcessor)

*C: command line, F: parameter file

After running the above commands the following files will be generated:

toy/result > ls -lt eval/ total 28 -rw-r----- 1 moeinzad moeinzad 4561 Apr 16 18:33 result.sing drwxr-x--- 2 moeinzad moeinzad 66 Apr 16 18:33 0 drwxr-x--- 2 moeinzad moeinzad 66 Apr 16 18:33 1 drwxr-x--- 2 moeinzad moeinzad 66 Apr 16 18:33 2 drwxr-x--- 2 moeinzad moeinzad 66 Apr 16 18:33 3

The result.sing file is the evaluation of haplotypes which are assembled from reads. The two columns indicate the “Match”, “Mismatch” of the overlaps between phased haplotypes and 454 reads. In order to investigate more the overlaps, one can take a look at the file named /*/ranbow.single.eval. The following lines are selected from /0/ranbow.single.eval file as an example. These lines show the similarity and dissimilarity of the overlaps between 454 and assembled haplotypes in one block. The number in parenthesis show the similarity and dissimilarity in the overlap. The arrow indicates which haplotype is assumed to be sampled from the same chromosomes that the 454 read is also sampled. For instance (8, 1) indicate 8 allele matches and 1 allele mismatches between 454 reads and the second assembled haplotype.

454 17 -------101010100------- illumina 10 00000000000000000000011 (5, 4) 10 1000000101010110001110- (8, 1) <--- 12 --0001000001010001101-- (7, 2) 12 --01-110100111010010--- (4, 5) 11 -1101100011101101010011 (6, 3) 11 -0010100110001110010--- (4, 5)

Ranbow phylo

To infer the evolutionary event of the organism with the aid of haplotypes, Ranbow phylo can be employed to extract and tune the alignments from all phased regions and report the topology of the evolutionary phylogenetic tree.

The Ranbow phylo can be executed in parallel mode as well. Therefore the files need to be indexed first.

python2_7_13 ranbow.py phylo -par hap.params -mode index

mode run in parallel :

python2_7_13 ranbow.py phylo -par hap.params -mode run -processorIndex i

mode collect:

python2_7_13 ranbow.py phylo -par hap.params -mode collect

then the collect mode is developed in order to report the final statistics for different phylogenetic tree topologies.

hap/RANBOW > mypy ranbow.py phylo -par params.txt -mode collect 7 ((((())))) 7 (((()()))) 29 (((())())) 13 (((()))()) 43 ((()())()) 26 ((())(()))

Moreover Ranbow phylo -mode tree calculates the mutation rates in the branch trees reports the relative branch lengths (by now only works for hexaploid).

hap/RANBOW > mypy ranbow.py phylo -par params.txt -mode tree Tree topology: (((A-B)I (C-D)J ) M (E-F) N) Branch Name Mean relative distance SD relative distance A: 0.674418604651 0.813288765118 B: 0.674418604651 0.813288765118 I: 0.639534883721 0.721961003548 C: 0.732558139535 1.05295176025 D: 0.732558139535 1.05295176025 J: 0.581395348837 0.604539345654 M: 0.886627906977 1.60414025402 E: 0.779069767442 0.891186337505 F: 0.779069767442 0.891186337505 N: 1.0523255814 1.2325032897

Mutation rate for A-B and C-D branches: 0.00701902094469 Mutation rate for E-F branch: 0.00848135858809 hap/RANBOW > hap/RANBOW >