Description
This track indicates any pair of exactly identical sequence
on each side of gaps for the 26 Apr 2016 Mus musculus (house mouse)/GCA_001624185.1_129S1_SvImJ_v1 genome assembly.
Where gaps are any run of N's, including a single N. The end of an
upstream sequence before the gap is duplicated exactly at the beginning
of the downstream sequence following the gap in the assembly.
The predictions are based on the genome sequence alone.
Item count: 19,571; Bases covered: 1,316,488,996
Methods
These duplicate sequences were found by taking 1,000 bases before and
after each gap and aligned with the blat command:
blat -q=dna -minIdentity=95 -repMatch=10 upstreamContig.fa downstreamContig.fa
Filtering the PSL output for a perfect match, no mis-matches,
and therefore of equal size matching sequence,
where the alignment ends exactly at the end of the upstream sequence,
and begins exactly at the start of the downstream sequence.
Credits
Thank you to Joel Armstrong and Benedict Paten of the
Computational Genomics Lab
at the
U.C. Santa Cruz Genomics Institute
for identifying this characteristic of genome assemblies.
The data and presentation of this track were prepared by
Hiram Clawson,
U.C. Santa Cruz Genomics Institute
|
|