This is a fork of partitioned_elias_fano. Modifications in this repository fix build breaks on Ubuntu 16.04 LTS and update the build system to avoid generating executables and libraries in the source tree.
Please note that this repository depends on submodules, so it helps to clone with the --recursive
flag.
sudo apt-get install libboost-all-dev
git clone --recursive [email protected]:BitFunnel/partitioned_elias_fano.git
cmake . -DCMAKE_BUILD_TYPE=Release
make
First generate an index using create_freq_index
. One way to create
the collection file used as input is to use the 'IndexExporter' class
in the mg4j-workbench project.
One can use mg4j to generate an index, from say the gov2 corpus, and then export
this index and an accompanying query log to format usable by
create_freq_index
and Runner
.
Usage:
create_freq_index type input output --check
type Index type. Choices are ef, single, uniform, opt
block_optpfor, block_varint, and block_interpolative.
input Collection file name. Format described below.
output Path to output file.
Use Runner
to perform performance measurements. The index
parameter
to Runner
is the output
file generated by create_freq_index
. Runner
runs performance trials at various thread counts and outputs summary data
along with detailed, query-by-query statistics.
Usage:
Runner type index queries maxthreads [output]
type Index type. Choices are ef, single, uniform, opt
block_optpfor, block_varint, and block_interpolative.
index Index file name.
queries Query log file name. One query per line.
Terms expressed as non-negative integers.
threads Run performance trials with thread counts in [1..threads].
output Base name for output files containing performance
measurements for various thread counts. If output is
omitted, summary data will be printed to stdout.
Original README.md follows.
NOTE: This repository is maintained only for historical reasons. This code is now part of ds2i.
This repository contains the code used for the experiments in the paper
- Giuseppe Ottaviano and Rossano Venturini, Partitioned Elias-Fano Indexes, ACM SIGIR 2014.
The code is tested on Linux with GCC 4.9 and OSX Mavericks with Clang.
The following dependencies are needed for the build.
- CMake >= 2.8, for the build system
- Boost >= 1.42
The code depends on several git submodules. If you have cloned the repository
without --recursive
, you will need to perform the following commands before
building:
$ git submodule init
$ git submodule update
To build, it should be sufficient to do:
$ cmake . -DCMAKE_BUILD_TYPE=Release
$ make
It is also preferable to perform a make test
, which runs the unit tests.
The directory test/test_data
contains a small document collection used in the
unit tests. The binary format of the collection is described in the next
section.
To create an index use the command create_freq_index
. The available index
types are listed in index_types.hpp
. For example, to create an index using the
optimal partitioning algorithm using the test collection, execute the command:
$ ./create_freq_index opt test/test_data/test_collection test_collection.index.opt --check
where test/test_data/test_collection
is the basename of the collection, that
is the name without the .{docs,freqs,sizes}
extensions, and
test_collection.index.opt
is the filename of the output index. --check
perform a verification step to check the correctness of the index.
To perform BM25 queries it is necessary to build an additional file containing the parameters needed to compute the score, such as the document lengths. The file can be built with the following command:
$ ./create_wand_data test/test_data/test_collection test_collection.wand
Now it is possible to query the index. The command queries
parses each line of
the standard input as a tab-separated collection of term-ids, where the i-th
term is the i-th list in the input collection. An example set of queries is
again in test/test_data
.
$ ./queries opt test_collection.index.opt test_collection.wand < test/test_data/queries
A binary sequence is a sequence of integers prefixed by its length, where both the sequence integers and the length are written as 32-bit little-endian unsigned integers.
A collection consists of 3 files, <basename>.docs
, <basename>.freqs
,
<basename>.sizes
.
-
<basename>.docs
starts with a singleton binary sequence where its only integer is the number of documents in the collection. It is then followed by one binary sequence for each posting list, in order of term-ids. Each posting list contains the sequence of document-ids containing the term. -
basename.freqs
is composed of a one binary sequence per posting list, where each sequence contains the occurrence counts of the postings, aligned with the previous file (note however that this file does not have an additional singleton list at its beginning). -
basename.sizes
is composed of a single binary sequence whose length is the same as the number of documents in the collection, and the i-th element of the sequence is the size (number of terms) of the i-th document.
- Giuseppe Ottaviano [email protected]
- Rossano Venturini [email protected]