Skip to content

Latest commit

 

History

History
148 lines (117 loc) · 4.94 KB

README.md

File metadata and controls

148 lines (117 loc) · 4.94 KB

IRWS-A2

Dependencies

Current list of dependencies:

  • Java 21
  • Maven
  • Makefile
  • Lucene (8.6.3)
    • lucene-core
    • lucene-queryparser
    • lucene-analyzers-common

Performance

MAP BM25 Boolean Classic IB LMDirichlet
EnglishAnalyzer 0.3177 0.0860 0.2013 0.3217 0.2875
StandardAnalyzer 0.2627 0.0802 0.1886 0.2626 0.2493
R-Precision BM25 Boolean Classic IB LMDirichlet
EnglishAnalyzer 0.3620 0.1362 0.2789 0.3618 0.3345
StandardAnalyzer 0.3123 0.1368 0.2543 0.3184 0.2980
P_5 BM25 Boolean Classic IB LMDirichlet
EnglishAnalyzer 0.6800 0.2560 0.4640 0.6480 0.5840
StandardAnalyzer 0.6400 0.2080 0.4800 0.6080 0.5760
P_10 BM25 Boolean Classic IB LMDirichlet
EnglishAnalyzer 0.6040 0.2160 0.4320 0.5920 0.5280
StandardAnalyzer 0.5560 0.1720 0.4160 0.5640 0.4960
P_100 BM25 Boolean Classic IB LMDirichlet
EnglishAnalyzer 0.2720 0.0980 0.2072 0.2740 0.2384
StandardAnalyzer 0.2248 0.0916 0.1776 0.2260 0.2060

Building & Running

Important

Before building and running, you must first place the dataset folder in src/main/resources/ directory. When you download it from the Google Drive it will be called Assignment 2.zip, but it should be unzipped and moved to a directory called dataset. Inside this dataset folder should be the following folders: dtds, fbis, fr94, ft, latimes.

This folder has not been added to the GitHub repo because of size limitations.

However, you can also do this automatically by following the instructions in the All in one or Before Running section below.

For example, part of the tree structure should look like this:

.
├── output
├── resources
│   ├── assets
│   └── dataset
│       ├── dtds
│       ├── fbis
│       ├── fr94
│       ├── ft
│       └── latimes
├── src
│   └── main
│       └── java
│           └── apple_sauce
│               ├── eNums
│               ├── models
└──             └── parsers

All in one

To run this program all in one, you may try the following command:

chmod +x start.sh
./start.sh

This will download all the needed dependency, dataset, build the program and run it.

However, you still need to download the trec_eval from Github and run it manually. Please follow the instructions in the Trec Eval section below.

Before Running

Fo manually run the program, please follow the instructions below.

You need to download and unzip the dataset from the Google Drive. The link is here:

https://drive.google.com/file/d/17KpMCaE34eLvdiTINqj1lmxSBSu8BtDP

To download it from the command line, firstly you need to install pip,gdown and unzip:

sudo apt install python3-pip
pip install gdown
sudo apt install unzip

If gdown is not added to PATH, then you need to source it:

echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc

Then run the following command:

chmod +x dataset.sh
./dataset.sh

This will download the dataset and unzip it into the correct directory.

Also, before running make, make sure you've installed maven:

sudo apt install maven

Commands

There is a Makefile which has the following options:

make clean
make build
make run
make (This runs "clean", "build" and then "run" in order)

Choosing which analyzer and similarity measure to use

When running the program, you'll be prompted to choose an analyzer to use. To achieve the best MAP score, the EnglishAnalyzer should be selected.

For the similarity measure, IB Similarity and BM25 Similarity can be chosen, as both gave sufficient MAP scores during our testing phase.

We'd recommend running a build using both versions.

Trec Eval

To run trec eval, you must first run the program and generate the results file. Then you can run the following command:

Download trec_eval from Github

git clone https://github.com/usnistgov/trec_eval.git

Go to the trec_eval directory

cd trec_eval

Run the following command

./treceval ../[your_qrels_file] ../output/[your_eval_file]