Gene Haplotype Alleles

The Gene Haplotype Alleles feature displays the chromosome-phased 1000 Genomes Phase 1 data for protein coding regions. These data comprise the genomes of 1,092 individuals from 14 populations in Africa, Europe, East Asia and the Americas, constructed using a combination of low-coverage whole-genome and exome sequencing.

The variant genotypes have been phased by the 1000 Genomes Project (i.e., the two alleles of each diploid genotype have been assigned to two haplotypes, one inherited from each parent).

How to use the "Gene Haplotype Alleles" section

Click on any protein-coding gene in the UCSC Genes track and scroll to the Common Gene Haplotype Alleles section. (The feature is currently implemented only on GRCh37/hg19 protein-coding genes.) There will be a table of haplotypes for the protein-coding portion of the gene. Each row in the table represents a unique gene haplotype as found in the 1000 Genomes Phase 1 project data. The table is sortable on any column by clicking on the column headers.

Haplotype Frequency
Haplotype frequencies are based upon all relevant chromosomes in the data set. The total number is almost always 2,184 (1,092 for Y and location-dependent for X). Hover over the frequency calculations to show the number of a particular haplotype in the dataset (e.g., "N=1327 of 2184"). If appropriate, the homozygous frequency will also be shown and will reflect the number of individuals in the dataset.

The reference haplotype (made of entirely reference variants) may not be represented in the 1000 Genomes data. If it is, it will be so marked in the "Ref" column.

Variant sites
By default, only non-synonymous, common variant sites are displayed. Common variants are defined as occurring in at least 1% of 1000 Genomes subject chromosomes.

includes rare and synonymous variant sites found in 1000 Genomes subjects in the list of haplotypes.

limits the display to haplotypes defined by common and non-synonymous variation.

displays variant sites (and full sequence) as DNA bases.

displays variant sites (and full sequence) as predicted amino acids. Predicted stop codons are represented by "]" and predicted frameshifts by "[>>]".

The reference variant is shown at the top of each variant site column. This is the value found in the GRCh37/hg19 reference genome at that variant site. In most cases it is a single letter (AA code/DNA base). In the case of an insertion with respect to the reference genome, the reference value is shown as "-". Large deletions are represented by the first two sequence letters followed by "+++".

Hovering the pointer over any of the variant site links will show a more complete description of that variant. For example, the variant description "AA:16 A|T chr9:136137554 SNP: G|A (0.995|0.005) rs55917063" consists of the following elements:

AA:16 A|T - AA residue number and variants (AA view only)
chr9:136137554 - genome location
SNP: G|A (0.995|0.005) - nucleotide variants and allele frequencies
rs55917063 - dbSNP variant name (if one is known) 

Clicking on any non-reference variant shown in the variant sites columns will link to the full details of that variant site in the 1000 Genomes phase 1 track.

Full sequence
Each haplotype allele sequence is generated from GRCh37/hg19 reference DNA, with variants spliced in, then translated into amino acids.

shows the predicted effects of variation on gene sequence for each of the haplotypes. If variant sites are currently displayed as DNA bases, then the predicted DNA sequence is shown (for coding regions only). If variant sites are displayed as amino acids, the predicted protein sequence is shown.

simultaneously shows the DNA sequence above the protein sequence for easy comparison. Showing protein sequence with the DNA triplets is the easiest way to visualize the synonymous variants.

shows the simplified protein sequences view.

hides the full sequence view completely.

Green vertical highlights accentuate the variant sites within the full sequence.

Bold red letters mark the effects of variation. Synonymous changes are only evident when DNA bases are displayed.

Blue vertical highlights show a variant that has been sorted on by clicking its column header. Sorting on a variant can be used to quickly locate one site out of many in the full sequence view.

The AA residue number is shown when hovering over any part of the sequence in amino acid view.

Rare haplotypes
By default, only common gene haplotype alleles are displayed. Common haplotype alleles are defined as occurring in at least 1% of the relevant 1000 Genomes subject chromosomes.

includes all haplotypes. Some large gene models cover many variants and therefore have a very large number of distinct haplotypes represented in the 1000 Genomes project data. If this is the case, only the 100 most frequently occurring haploptyes will be shown in the table, though the true number will be noted.

limits the display to only common haplotypes.

Populations
Each haplotype is found in one or more subjects participating in phase 1 of the 1000 Genomes project. The 1000 Genomes populations, defined below, are grouped into broad categories (a.k.a. super populations). Haplotype distributions are available both for 1000 Genomes populations and for the major groupings. They are hidden by default.

displays the distribution of the haplotypes across the major population groups.

displays the distribution across the more specific 1000 Genomes groups.

changes display from the 1000 Genomes grouping to the major grouping.

hides the population columns.

Each population group is shown as a column in the table, and each row shows the percent of that haplotype that is found in each group. This is not the same as the percent of each group that has the haplotype. Hover over the distribution numbers to show the frequency of occurrence of the haplotype within each group. For example, hovering over 25.7 might show "N=304 of 1183 (found in 71.0% of all ASN)", meaning that of the 1183 occurrences of the haplotype, 304 or 25.7% are found in the ASN group and that 71.0% of all East Asian copies of this gene (in 1000 Genomes phase 1 data) match this haplotype. To see the number of 1000 genomes chromosomes covered for each group, hover over the column header (e.g., ASN will usually show "East Asian [N=572]").

Scoring
By default, scoring is hidden. Three types of scores are provided to help users find haplotype alleles that occur more or less frequently than expected or that have unusual distributions in populations. See definitions below.

displays all score columns.

hides all score columns.

Population group definitions

The numbers listed here are of individuals, but the numbers used in generating the haplotypes table are frequently the number of relevant chromosomes (e.g., 2184 not 1092).

Major Groups

Includes only major groups for which there are data in phase 1 of the 1000 Genomes project.
AFRAfrican 246 individuals
AMRAd Mixed American181 individuals
ASNEast Asian 286 individuals
EUREuropean 379 individuals

1000 Genomes Groups

Includes only 1000 Genomes groups for which there are data in phase 1 of the project.

African:
ASWAfrican Ancestry in Southwest US 61 individuals
LWKLuhya in Webuye, Kenya 97 individuals
YRIYoruba in Ibadan, Nigeria 88 individuals

Ad Mixed American:
CLMColombian in Medellin, Colombia 60 individuals
MXLMexican Ancestry in Los Angeles, California66 individuals
PURPuerto Rican in Puerto Rico 55 individuals

East Asian:
CHBHan Chinese in Beijing, China 97 individuals
CHSHan Chinese South 100 individuals
JPTJapanese in Tokyo, Japan 89 individuals

Europeans:
CEUUtah residents with Northern and Western European ancestry85 individuals
FINFinnish in Finland 93 individuals
GbrBritish in England and Scotland 89 individuals
IBSIberian populations in Spain 14 individuals
TSIToscani in Italia 98 individuals

Scoring definitions

Scores alone cannot be used to draw definitive conclusions about any haplotype.

Hap score
The haplotype score is based on the normalized (-log10) probability of finding exactly N subject chromosomes with this haplotype, given the frequencies of individual variants and assuming they are independent. The score is normalized by multiplying the base probability by the total number of variants. Normalization allows comparing the scores between genes with many variant sites and those with few. The score will be positive if the haplotype is more frequent than expected by chance and negative if less frequent. Larger scores will result when minor variant alleles occur together more frequently than expected, which might reflect co-selection or may merely be an artifact of more recent events. A negative haplotype score may be more informative. For haplotypes made from common, non-synonymous variants, haplotype scores above 606 are seen in only 2% of genes. Likewise, a score of less than -199 is only seen in 2% of genes.

Hom score
The homozygous score is based upon the (-log10) probability of finding exactly N individuals with this haplotype on both chromosomes, given the actual frequency of the haplotype in subject chromosomes. The score will be positive if the haplotype is found homozygous in more individuals than expected and negative when found in fewer than expected. Negative values might suggest that the haplotype is deleterious when homozygous. For haplotypes made from common, non-synonymous variants, homozygous scores above 92 are seen in only 2% of genes. Likewise, a score of less than -15 is only seen in 2% of genes.

Pop score
The population score (only visible when population distributions are displayed) is the fixation index (FST) based upon the difference in variance between sub-population haplotype frequencies and the total haplotype frequency. Note that this calculation is based upon the frequency of haplotype, rather than the distribution of that haplotype across populations. Nevertheless, large population scores should reflect large skews in distribution in more frequently occurring haplotypes. For haplotypes made from common, non-synonymous variants, population scores above 0.424 are seen in only 2% of genes and scores above 0.506 are seen in only 1% of genes.

Notes:
  1. If the gene is on the negative ('-' or "reverse") strand, all variant sites and sequences will be presented with respect to the negative strand. This differs from the way variants are displayed in the 1000 Genomes phase 1 variations track, which are always shown as they appear on the positive ('+' or "forward") strand.
  2. Only variant sites occurring within coding exons are currently included in haplotypes. Variants occurring within intron splice junctions are not included.
  3. Haplotypes are defined by the set of variant sites included and the variant allele at each of those sites. Therefore haplotype and homozygous frequency calculations depend upon which variant sites are included. Likewise all scores are specific to the haplotype as defined by the variant sites included, and population scores are also specific to the population groups that are examined.
  4. The haplotypes displayed are not pregenerated but are derived from 1000 Genomes VCF files and other Genome Browser dataset at the time they are requested. Consequently, scoring is calculated in the context of a single gene model and the variant sites used to derive haplotypes.
  5. If the number of variants covered exceeds 200, then the haplotype table will not be displayed and the reason so noted.
  6. Certain viewing options are expensive operations which will slow the gene page response time. If this section is not being actively used, it is recommended that previous choices are cleared by pressing "Reset to defaults" at the bottom of this section.