Run-Length Compressed Suffix Array

RLCSA [7, 3] is a compressed suffix array implementation that has been optimized for highly repetitive text collections. Examples of such collections include version control data and individual genomes. This implementation also serves as a testbed for many techniques used with compressed suffix arrays. The most up-to-date description of RLCSA can be found in [7].

There is also Adam Novak's fork of RLCSA. Among other things, the fork includes SWIG bindings and an utility for merging indexes.

The current version includes experimental support for:

See README in the package for further information. The implementation is available for download under the MIT / X11 License.

There is also a separate package containing implementations of additional document retrieval algorithms:

Full experimental results from [12] are also included in the document listing package.

News

Downloads

Current versions

RLCSA

Document listing

Data

The following data sets were used in experiments with RLCSA. Some of them are available from the Pizza&Chili Corpus (Chile, Italy).

The full test environment from [10, 11, 12] is also available upon request. Note that setting it up requires much manual work, and some of the implementations use 32-bit libraries.

References

  1. Jouni Sirén, Niko Välimäki, Veli Mäkinen, and Gonzalo Navarro: Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections.
    Proc. SPIRE 2008, Springer LNCS 5280, pp. 164-175, Melbourne, Australia, November 10-12, 2008.
  2. Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki: Storage and Retrieval of Individual Genomes.
    Proc. RECOMB 2009, Springer LNCS 5541, pp. 121-137, Tucson, Arizona, USA, May 18-21, 2009.
  3. Jouni Sirén: Compressed Suffix Arrays for Massive Data.
    Proc. SPIRE 2009, Springer LNCS 5721, pp. 63-74, Saariselkä, Finland, August 25-27, 2009.
  4. Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki: Storage and Retrieval of Highly Repetitive Sequence Collections.
    Journal of Computational Biology 17(3):281-308, 2010.
  5. Jouni Sirén: Sampled Longest Common Prefix Array.
    Proc. CPM 2010, Springer LNCS 6129, pp. 227-237, New York, USA, June 21-23, 2010.
  6. Paolo Ferragina, Jouni Sirén, and Rossano Venturini: Distribution-Aware Compressed Full-Text Indexes.
    Proc. ESA 2011, Springer LNCS 6942, pp. 760-771, Saarbrücken, Germany, September 5-7, 2011.
  7. Jouni Sirén: Compressed Full-Text Indexes for Highly Repetitive Collections (PhD Thesis).
    Department of Computer Science, Series of Publications A, Report A-2012-5, University of Helsinki, June 2012.
  8. Travis Gagie, Kalle Karhu, Gonzalo Navarro, Simon J. Puglisi, and Jouni Sirén: Document Listing on Repetitive Collections.
    Proc. CPM 2013, Springer LNCS 7922, pp. 107-119, Bad Herrenalb, Germany, June 17-19, 2013.
  9. Paolo Ferragina, Jouni Sirén, and Rossano Venturini: Distribution-aware compressed full-text indexes.
    Algorithmica 67(4):529-546, 2013.
  10. Gonzalo Navarro, Simon J. Puglisi, and Jouni Sirén: Document Retrieval on Repetitive Collections.
    Proc. ESA 2014, Springer LNCS 8737, pp. 725-736, Wroclaw, Poland, September 8-10, 2014.
  11. Travis Gagie, Aleksi Hartikainen, Juha Kärkkäinen, Gonzalo Navarro, Simon J. Puglisi, and Jouni Sirén: Document Counting in Compressed Space.
    Proc. DCC 2015, IEEE, pp. 103-112, Snowbird, Utah, USA, April 7-9, 2015.
  12. Travis Gagie, Aleksi Hartikainen, Kalle Karhu, Juha Kärkkäinen, Gonzalo Navarro, Simon J. Puglisi, and Jouni Sirén: Document retrieval on repetitive string collections.
    Information Retrieval Journal 20(3):253-291, 2017.