Compressed FullText Indexes for Highly Repetitive Collections
PhD Thesis, Department of Computer Science, University of Helsinki, 2012
Abstract
This thesis studies problems related to compressed fulltext indexes. A fulltext index is a data structure for indexing textual (sequence) data, so that the occurrences of any query string in the data can be found efficiently. While most fulltext indexes require much more space than the sequences they index, recent compressed indexes have overcome this limitation. These compressed indexes combine a compressed representation of the index with some extra information that allows decompressing any part of the data efficiently. This way, they provide similar functionality as the uncompressed indexes, while using only slightly more space than the compressed data.
The efficiency of data compression is usually measured in terms of entropy. While entropybased estimates predict the compressed size of most texts accurately, they fail with highly repetitive collections of texts. Examples of such collections include different versions of a document and the genomes of a number of individuals from the same population. While the entropy of a highly repetitive collection is usually similar to that of a text of the same kind, the collection can often be compressed much better than the entropybased estimate.
Most compressed fulltext indexes are based on the BurrowsWheeler transform (BWT). Originally intended for data compression, the BWT has deep connections with fulltext indexes such as the suffix tree and the suffix array. With some additional information, these indexes can be simulated with the BurrowsWheeler transform. The first contribution of this thesis is the first BWTbased index that can compress highly repetitive collections efficiently.
Compressed indexes allow us to handle much larger data sets than the corresponding uncompressed indexes. To take full advantage of this, we need algorithms for constructing the compressed index directly, instead of first constructing an uncompressed index and then compressing it. The second contribution of this thesis is an algorithm for merging the BWTbased indexes of two text collections. By using this algorithm, we can derive better spaceefficient construction algorithms for BWTbased indexes.
The basic BWTbased indexes provide similar functionality as the suffix array. With some additional structures, the functionality can be extended to that of the suffix tree. One of the structures is an array storing the lengths of the longest common prefixes of lexicographically adjacent suffixes of the text. The third contribution of this thesis is a spaceefficient algorithm for constructing this array, and a new compressed representation of the array.
In the case of individual genomes, the highly repetitive collection can be considered a sample from a larger collection. This collection consists of a reference sequence and a set of possible differences from the reference, so that each sequence contains a subset of the differences. The fourth contribution of this thesis is a BWTbased index that extrapolates the larger collection from the sample and indexes it.
Thesis

Jouni Sirén:
Compressed FullText Indexes for Highly Repetitive Collections (PhD Thesis).
Department of Computer Science, Series of Publications A, Report A20125, University of Helsinki, June 2012.
Original papers

Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki:
Storage and Retrieval of Highly Repetitive Sequence Collections.
Journal of Computational Biology 17(3):281308, 2010. 
Jouni Sirén:
Compressed Suffix Arrays for Massive Data.
Proc. SPIRE 2009, Springer LNCS 5721, pp. 6374, Saariselkä, Finland, August 2527, 2009.

Jouni Sirén:
Sampled Longest Common Prefix Array.
Proc. CPM 2010, Springer LNCS 6129, pp. 227237, New York, USA, June 2123, 2010.

Jouni Sirén, Niko Välimäki, and Veli Mäkinen:
Indexing Finite Language Representation of Population Genotypes.
Proc. WABI 2011, Springer LNCS 6833, pp. 270281, Saarbrücken, Germany, September 57, 2011.