Welcome. This repository contains the required data to reproduce the experiments from the following paper. We request you cite this work if you decide to use our algorithm for your own papers.
Joel Mackenzie, Antonio Mallia, Matthias Petri, J. Shane Culpepper, and Torsten Suel: Compressing Inverted Indexes with Recursive Graph Bisection: A Reproducibility Study. In Proc. ECIR, 2019, 339-352.
@inproceedings{mm+19-ecir,
author = {Mackenzie, Joel and Mallia, Antonio and Petri, Matthias and Culpepper, J. Shane and Suel, Torsten},
title = {Compressing Inverted Indexes with Recursive Graph Bisection: A Reproducibility Study},
booktitle = {Proc. ECIR},
year = {2019},
pages = {339--352},
}
The core bisection algorithm exists in the PISA project, which is included here as a submodule. We also use Indri to index the text collections.
The original paper can be found here: Compressing Graphs and Indexes with Recursive Graph Bisection, ACM Proceedings
@inproceedings{dk+16-kdd,
author = {L. Dhulipala and I. Kabiljo and B. Karrer and G. Ottaviano and S. Pupyrev and A. Shalita},
title = {Compressing Graphs and Indexes with Recursive Graph Bisection},
booktitle = {Proc. SIGKDD},
year = {2016},
pages = {1535--1544}
}
We thank the authors of this paper for their assistance in our reproducibility study.
This repository aims to show how we go from end-to-end for our experiments, allowing
others to repeat them. The main scripts are build.sh
and end-to-end.sh
.
Note that the Indri configuration files in configuration/indri_config
must be
populated with correct paths before proceeding.
So, the steps are as follows:
- Configure the Indri parameter file correctly.
- Build the code:
./build.sh
- Run the code, end to end:
./end-to-end.sh <path_to_indri_config> <output_basename>
An example of running end-to-end.sh and explanation is shown below for the gov2
collection: ./end-to-end.sh ./configuration/indri_config/gov2.param my-gov2
- Create the Indri index for the specified collection. Output path depends on the Indri configuration file.
- Convert the Indri index into the ds2i binary collection format:
my-gov2_original_ds2i/
- Generate a random shuffle of the binary collection:
my-gov2_random_ds2i/
- Generate a minhash shuffle of the binary collection:
my-gov2_minhash_ds2i/
- Generate a bp shuffle of the binary collection:
my-gov2_bisection_ds2i/
- Generate a frequency index for each collection:
my-gov2_indexes/
As mentioned in the paper, two of the collections are generated from dumps that
are difficult to reproduce. We have archived these collections and we are happy
to make them available to anyone interested in reproducing our experiments.
Please email [email protected]
if you are looking for this data or
any further assistance with these experiments.
The URL ordering files and the bisection mappings can be found on Cloudstor.