To detect Copy Number Variants (CNVs) or SVs, Comparative Genomic Hybridization (CGH) and Paired-End (PE) sequencing are the major techniques being used to date. With our experience in microarray design and analysis (Bertone et al. 2004 & 2006, Royce et al. 2007, Yu 2007), we previously demonstrated how to use high-resolution CGH based on tiling array for precise copy number mapping in mammalian cells (Urban et al. 2006). In addition, we described a non-parametric method using a mean-shift-based (MSB) approach to detect the segments of changed copy numbers in array-CGH data by determining local modes of the signal, which later has been applied to signal in read-depth analysis as well (Wang et al. 2008). To compensate for the approximate CNV coordinates from CGH experiments, we therefore developed an approach, called BreakPtr, for fine-mapping CNVs and suggested a predictive resolution (~ 300 bp) that could enable more precise correlations between CNVs and across individuals than previously possible (Korbel et al. 2007 Jun).
Apart from CGH, we also developed a high-throughput and massive sequencing plus computational method, Paired-End Mapping (PEM), to identify structural variants, down to almost 3kb in size, between genomes on a large-scale (Korbel et al. 2007 Oct). To facilitate the SV detection from massive paired-end sequencing, we further developed a cross-platform computational framework, PEMer, for structural variant identification with sequences from different sequencing platforms such as 454 and Solexa. The analysis pipeline aims to map SVs at high-resolution by paired-end sequences and has showed improved sensitivity and specificity over previous approaches (Korbel 2009). Recently, we have also built a simulation toolbox that will help optimize the combination of different technologies to perform comparative genome re-sequencing, especially in reconstructing large SVs (Du et al. 2009).
In addition to the experimental and computational detection and improvement, we have also carried out detailed computational analysis of copy number variants and segmental duplications. Previously we showed that Segmental Duplications (SDs) preferentially occur on the protein network periphery comparing to CNVs, which together provide evidence of adaptive events at the periphery (Kim et al. 2007). We also looked at the correlation between CNV, gene duplications, and protein families, and showed trends that are likely reflective of CNV formation biases and natural selection, which differentially influence distinct protein families (Korbel 2008). More recently, we analyzed a high resolution CNV map of the Olfactory Receptor (OR) gene superfamily and indicated considerable OR dosage variations in humans (Hasin 2008). Furthermore, we examined the formation signatures of CNV and SD and found that Alus, believed as the major mediator of SD formation, has a decreased effect in younger SD and CNV formation while non-homologous end joining (NHEJ) and repeats, such as the LINE 1 element and microsatellite, have become a more prominent driver (Kim et al. 2008).