e.g. 1:1246..9423
How to get variation?
Pipeline of snp calling
For Solexa Data

1. Use BWA (version: 0.6.2-r126) to map illumina short reads to dog reference sequence Canfam3.

2. Use Picard(version: 1.87) to eliminate duplicated reads generated in library construction PCR.

3. Use tools in GATK (version: 2.5-2-gf57256b ) for each sample to realign reads around known indels (downloaded from Ensembl ftp://ftp.ensembl.org/pub/release-73/variation/vcf/canis_familiaris/), and recalibrate base quality score to get more accurate quality score for each base.

4. Use GATK UnifiedGenotyper to call a raw SNPs set from those refined data of all individuals, and then use the variant quality score recalibrate procedure in GATK to identify a high quality set of SNPs.

For SOLiD Data

SNP identification and validation

SNPs were detected by the BioScope diBayes tool where valid adjacent two-base mismatches occurred between a read and the reference. Depending on high or low reads coverage on the reference, either a Frequentist or Bayesian algorithm is applied, respectively. The Frequentist algorithm was based on the null hypothesis that a given position is homozygous and any other valid adjacent mismatches are errors subject to the Poisson distribution. The Bayesian algorithm calculated the posterior probability of each site according to the expected polymorphism rate in the genome, the GC content, the coverage, position in a read and the quality value of the color call and the prior errors derived from the 6mer probe annealing error. The SNP calling stringency was set up medium. In order to confirm the accuracy of the SNPs we found, we randomly selected 85 sites from our SNP list, and the whole mitochondrial genome for validation using traditional Sanger Sequence technology.

Small indels identification

One-end-anchored (OEA) mate-pair read is re-aligned to the reference genome and a more aggressive gapped-alignment method is applied to the other tag within the expected range decided by insert-size. The maximum mismatch for each paired tags were less than 5. The second stage was filtering and annotation. Two criteria were used to filter the results: 1) the indel detected by at least two reads with different start point, 2) the total mismatches of the two mate-paired reads were less than 10.

Data Source
Reference sequences used in the process of snp calling.
Assembly Species Sequence type Version Source
CanFam 3.0 Canis familiaris Gene 3.0 Ensembl BioMart: Ensembl Genes 75
Canis_familiaris.CanFam3.1.75.cds Canis familiaris CDS 3.1 Ensembl FTP
Canis_familiaris.CanFam3.1.75.pep Canis familiaris Protein 3.1 Ensembl FTP
Canis_familiaris.CanFam3.1.75.gtf Canis familiaris Gene Structure 3.1 Ensembl FTP