A new implementation is available on GitHub.

GCSA [2, 3, 4] is a generalised compressed suffix array for finite languages a.k.a. labeled directed acyclic graphs (labeled DAGs). The implementation now supports general alphabet and multiple automata. The most up-to-date description of GCSA can be found in [3,4].

See README in the package for further information.

The implementation is available for download under the MIT / X11 License. Our implementation of RLCSA [1, 3] is required for compiling GCSA. (The current version of GCSA should always work with the current version of RLCSA.)



Additional downloads

Prebuilt indexes

Building GCSA for the entire human genome requires more memory than is available on most systems. To alleviate the problem, we provide some prebuilt indexes for download. These indexes should work on any 64-bit Intel/AMD system running May 2013 version of GCSA compiled with g++.


  1. Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki: Storage and Retrieval of Highly Repetitive Sequence Collections.
    Journal of Computational Biology 17(3):281-308, 2010.
  2. Jouni Sirén, Niko Välimäki, and Veli Mäkinen: Indexing Finite Language Representation of Population Genotypes.
    Proc. WABI 2011, Springer LNCS 6833, pp. 270-281, Saarbrücken, Germany, September 5-7, 2011.
  3. Jouni Sirén: Compressed Full-Text Indexes for Highly Repetitive Collections (PhD Thesis).
    Department of Computer Science, Series of Publications A, Report A-2012-5, University of Helsinki, June 2012.
  4. Jouni Sirén, Niko Välimäki, and Veli Mäkinen: Indexing Graphs for Path Queries with Applications in Genome Research.
    IEEE/ACM Transactions on Computational Biology and Bioinformatics 11(2):375-388, 2014.