Diploid Genomics Group

Scientific overview

Human individuals are diploid by nature. Therefore, the independent determination of both haplotype sequences of an individual genome is essential to link genetic variation to gene and genome function, phenotype and disease. To address the importance of phase, we have established approaches, resources and methods to generate multiple haplotype-resolved genomes (>2.5 Terabytes of data) and obtain first key results characterizing the diploid landscape. Thus, our work includes the following components:

  • Development of molecular genetics and bioinformatics approaches/methods to haplotype-resolve whole genomes and their application to data production
  • Analysis and annotation of haplotype-resolved human genomes at the individual and population level
  • Establishment of public resources to enable integration of phase information at the gene and genome level

In summary, this work will advance our understanding of the inherently diploid biology of genes and genomes, of genotype-phenotype relationships, and prepare the ground for ‘phase-sensitive’ personal genomics and individualized medicine. Our work has been communicated to the public in a number of press releases and articles including MPG2014 Ger, En; MPIMG2012; MPG2011 Ger, En; MPF2011; MPR2012; MPF2014; GenomXpress; Nature Methods; GenomeWeb; ScienceDaily; ZME Science.

A fosmid pool-based next generation sequencing (NGS) approach to haplotype-resolve whole genomes

To provide the basis, we have established a world-wide unique “Haploid Reference Resource” (HRR) of 100 fosmid libraries from 100 individuals of a representative German population cohort (PopGen). Briefly, individual genomic DNA was sheared to generate DNA fragments of ~40 kb DNA. These were ligated into pEpiFos vector and amplified in E. coli to generate ~1.5 Mio haploid fosmid clones (equivalent to ~7x coverage of each haploid genome). These were distributed into three 96-well plates to generate ‘haploid clone pools’ of ~5000 fosmids per well (Burgtorf et al., Genome Res, 2003). To increase throughput, three fosmid pools were each combined into super-pools of 15,000 fosmids. Importantly, the probability that complementary haplotypes may co-occur within such a super-pool is P < 0.0112. See module (1), Fig. 1, overview of the method.

The super-pools were barcoded and multiplex sequenced on a SOLiD platform (2). Then, fosmids were detected from read coverage separately per pool (3), and SNPs called. Subsequently, the fosmid sequences were combined to one virtual pool (4) to identify phase-informative heterozygous positions. Finally, fosmid sequences were separated and tiled into contiguous haplotype sequences by allelic identity at multiple heterozygous positions within the regions of overlap (5). To this end, we applied our heuristic phasing algorithm RefHap (Duitama et al., 2010. The phased haploid contigs were anchored onto their homologous autosomes to generate fully phased sets of 22 homologous chromosomes (6). For each of these modules we have developed fosmid-based wet lab protocols and/or bioinformatics algorithms to establish an independent, integrated NGS data production and analysis pipeline. For detailed description and protocols see Suk et al., Genome Res, 2011 and Supplementary Information.

Data production, quality, and validation: To date, we have produced 17 molecular haplotype-resolved genomes, an unprecedented number. 14 of those are included in three publications. The completeness of phasing has been underscored at the example of “Max Planck One” (MP1), the first haplotype-resolved genome, with > 99% of all heterozygous SNPs phased into haploid sequences of up to 6.3 Mb contig length, and over 50% of the genome in contigs > 1 Mb. The determination of contiguous haplotype sequences in the Mb range is crucial to translate individual genomic variation into the functionally active proteome. Accordingly, we have characterized over 700 haploid landscapes > 1 Mb (for example see Fig. 2; Suk et al., Genome Res, 2011). In particular, we have resolved extended haplotypes in the MHC, a region of key clinical relevance. Moreover, the majority of genes included up to 5.8 Mb up- and downstream sequences (Suk et al., Genome Res, 2011).

The power and accuracy of our method were demonstrated by haplotype-resolving HapMap trio child NA12878, for whom trio sequencing data had been released from the 1000 Genomes Project. Where phasing data were available from both approaches, they were 100% identical. Fosmid-based phasing, however, resolved ~20% more of the heterozygous SNPs (> 98% in total), setting a new gold-standard. In particular, phase information was generated for almost all potentially disease-relevant SNPs (Duitama et al., 2012).

Analysis and annotation of haplotype-resolved human genomes at the individual and population level

Individual level: To dissect an individual’s diplotype separately in each of the genomes, we assessed (i) the molecular diplotypes encoding 17,861 autosomal genes at the sequence and protein level; (ii) cis and trans configurations of protein-altering mutations (exemplification Fig. 3), and (iii) the potential impact of these mutations on gene function, disease and treatment response, depending on their phase.

Key results: The fractions of diplotypes within each of the individual genomes were substantial and similar. About 85% of genes (primary transcripts), 80% of 10 kb upstream sequences, and 95% of the genes together with their upstream sequences, contain at least one heterozygous SNP and so have two different molecular forms. Approximately 90% of these diplotypes have two or more SNPs, which could exist in cis or trans and therefore require phasing. We determined the concrete pairs of molecular haplotypes in up to 95% of cases. Within each individual genome, between 16 and 22% of all autosomal genes were found to encode two different proteins. Their mutations existed significantly more frequently in cis than in trans (average ratio 60:40) in each individual genome. Over 55% of the genes with mutations in cis or trans were potentially clinically relevant in an individual and were comprehensively evaluated with respect to the individual’s disease risk and treatment measures.  

Population level: We have performed a first population level analysis of haplotype-resolved genomes (Hoehe et al., Nature Communications, 2014 and Supplementary Information). We used a set of 14 molecularly haplotype-resolved genomes, complemented and expanded by up to 372 statistically resolved genomes from the 1000 Genomes Project. The analysis of multiple haplotype-resolved genomes allows addressing the following key questions: (i) What is the entirety/diversity of haploid and diploid gene forms that constitute the ’diploid hardware’ of cellular and organismal functions and their variation in population samples of defined size? (ii) Is there a common set of genes that preferentially encode two different proteins? (iii) What is the distribution of cis versus trans configurations at the gene and whole genome level?

Key results: We found immense diversity of both haploid and diploid gene forms, up to 4.1 and 3.9 million corresponding to 249 and 235 per gene on average, with > 85% of the genes lacking a predominant form. We identified a ‘common diplotypic proteome’, a distinctive subset of 4,269 genes encoding two different proteins in over 30% of the genomes. Mutations predicted to alter protein function existed, in each of the 386 genomes, significantly more frequently in cis than in trans, at an average cis/trans ratio of 60:40 (see Fig. 3). Global cis-abundance of mutations could be expected to preserve organismal function. Moreover, distinguishable classes of cis- versus trans-abundant genes were observed. With this work, we have identified key features characterizing the diploid nature of human genomes and provided a conceptual and analytical framework, rich resources and novel hypotheses on the functional importance of diploidy.

Taken together, the analysis of molecular haplotypes and diplotypes at the genome, transcriptome, proteome and functional genomics level is essential to understand the inherently diploid biology of genes and genomes and prepare the ground for personal genomics and precision medicine. Our present and future work is directed towards this goal.

Public Resources: The 'Max Planck Haplome Resource'

The following data and material are available to the Scientific Community and are currently being expanded to establish a ‘Max Planck Haplome Resource’. This resource will provide useful haplotype information for all aspects of genome biology and functional genomics, disease gene discovery, individualized medicine and pharmacogenomics.

 

Haplotype resolved genomes

‘Max Planck One’ (MP1)

NGS data and variant files can be downloaded from:

http://www.molgen.mpg.de/~genetic-variation/MaxPlanckOneData

Accession code: European Nucleotide Archive (ENA) accession no. ERP000494 

Chromosomal haplotypes are available in a UCSC session:

http://www.molgen.mpg.de/~genetic-variation/MaxPlanckOneUCSC

 

HapMap Trio Child NA12878

NGS data files can be downloaded from:

http://www.molgen.mpg.de/~genetic-variation/SIH/Data/

Accession code: European Nucleotide Archive (ENA) accession no. ERP000819

 

12 genomes (MP2 - MP13)

NGS data and variant files can be downloaded from:

http://www.molgen.mpg.de/~genetic-variation/NGS

Accession code: European Nucleotide Archive (ENA) accession no. PRJEB7549

 

Diplotypic Gene Sets (detailed explanations in Hoehe et.al., Nature Communications, 2014

Category 1, 2, 3 genes can be downloaded from:
http://www.molgen.mpg.de/~genetic-variation/genes_categories

‘Common diplotypic proteome’ can be downloaded from:

http://www.molgen.mpg.de/~genetic-variation/common_diplotypic_proteome

Phase-alternate genes can be downloaded from:

http://www.molgen.mpg.de/~genetic-variation/phase_alternate_genes

Phase-alternate mutations can be downloaded from:

http://www.molgen.mpg.de/~genetic-variation/phase_alternate_mutations

 

Haploid landscapes in the Megabase range (detailed explanations in Suk et.al., Genome Res, 2011, Supplementary Information) can be downloaded from:

http://www.molgen.mpg.de/~genetic-variation/MaxPlanckOneLandscapes

 

Algorithms

Single Individual Haplotyping (SIH) algorithm RefHap can be downloaded from:

http://www.molgen.mpg.de/~genetic-variation/SIH/Data/algorithms

Go to Editor View