Skip to content

Commit

Permalink
add section on helper files
Browse files Browse the repository at this point in the history
  • Loading branch information
pikulet committed Apr 21, 2019
1 parent 8bfcd81 commit 1cffd82
Show file tree
Hide file tree
Showing 2 changed files with 50 additions and 1 deletion.
5 changes: 4 additions & 1 deletion PostingList.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,10 @@
A Postings class that collects all the posting lists.
Each posting list is a dictionary mapping docIDs to a list [term frequency, list_of_positions]
For example, doc1 has "egg" in position 1 and 15, doc2 has "egg" in position 3.
"egg" -> { doc1 : [2, [1, 15]], doc2 : [1, [3]] }
{ doc1 : [2, [1, 15]], doc2 : [1, [3]] }
The Dictionary class stores the offset for the term "egg", which is used to retrieve this
posting list (represented as a python dictionary).
'''
class PostingList():

Expand Down
46 changes: 46 additions & 0 deletions README.txt
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,14 @@ data). More specifically, this was in document 2044863. We did not add any speci
because the case is isolated and it is unlikely that the user wants to look something up from the document.
There are also other unicode characters that are not recognised. These can be easily resolved with utf-8 encoding.

# System Architecture

We divided our system into three main components: Indexing, Helper Modules, Searching.

The indexing modules are used for the indexing algorithm.
The helper modules are used for BOTH indexing and searching. (File reading and writing, shared term normalisation methods, shared constants)
The searching modules work with document ranking and query expansion.

# Indexing

We first tokenise the documents using the nltk tokenisers with stemming and case folding as in previous homeworks.
Expand Down Expand Up @@ -170,6 +178,44 @@ A summary of the document properties we kept track of:
- BIGRAM_TITLE_LENGTH: length of document TITLE only biword index, for normalisation of triword terms
- TRIGRAM_TITLE_LENGTH: length of document TITLE only triword index, for normalisation of triword terms

# Helper Modules

We have three helper modules that are consistently used across indexing and searching.

## data_helper.py

The data helper contains shared methods, and direct file reading and writing.
Shared methods includes NLTK tokenisation methods, since we want to apply consistent tokenisation for both the document and query.

The direct file reading and writing is handled by pickle, but we abstract everything here so we could experiment with different
serialisation modules (JSON, XML). We wanted to experiment with this aspect because our initial searching was very slow and inefficient.
We did online research (see references) on serialisation methods, and tested them out. In the end, our results still directed us back to pickle.

## properties_helper.py

The properties helper module manages document properties. We store the constants needed to access the specific document properties here.
Having this module was useful in our experimentation, because we could add more document metadata easily. Document properties have no
actual relevance to document retrieval, and they mostly help with the relative ranking of the retrieved documents, and for relevance feedback.

For example:
(1) lengths are used in normalisation of scores
(2) court and data metadata can be weighed in the scores
(3) document vectors are used for relevance feedback

## constants.py

(1) Test files
This file contains the test files we were working with. For example, we only index the first 100 entries for most of the testing phase for
indexing. We can run this indexing locally, and the indexing of the full 17 000 entries on Tembusu.

(2) Intermediate files
As shown above, we have a lot of intermediate files (that are not dictionary.txt and postings.txt). These file names are stored here to be
written to at indexing time and read from at searching time.

(3) Searching Parameters
To facilitate our experimentation, we tried a lot of settings for the search. These settings are encapsulated here, for example we can
set weights and ranking orders.

# Searching

At the top level of searching, query expansion is first done to the original query string to produce
Expand Down

0 comments on commit 1cffd82

Please sign in to comment.