Skip to content

Commit

Permalink
edit readme
Browse files Browse the repository at this point in the history
  • Loading branch information
theJrLinguist committed Apr 21, 2019
1 parent 2aa90f7 commit 1b23860
Show file tree
Hide file tree
Showing 2 changed files with 28 additions and 33 deletions.
36 changes: 23 additions & 13 deletions README.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@ [email protected]
[email protected]
[email protected]

@@@ write the final decision

== Python Version ==

We're using Python Version <3.6> for this assignment.
Expand Down Expand Up @@ -34,7 +36,7 @@ data). More specifically, this was in document 2044863. We did not add any speci
because the case is isolated and it is unlikely that the user wants to look something up from the document.
There are also other unicode characters that are not recognised. These can be easily resolved with utf-8 encoding.

# Indexing Algorithm
# Indexing

We first tokenise the documents using the nltk tokenisers with stemming and case folding as in previous homeworks.
Our aim is to model the baseline tf-idf method as closely as possible. The increased features would mostly come
Expand Down Expand Up @@ -117,11 +119,11 @@ and take up little space (document length for biword and triword vectors) are do
stored. The operations which can be done quickly, and take up too much space when stored (term frequencies
and document frequencies) are done at searching time.

# Summary of index content
## Summary of index content



# Searching Algorithm
# Searching

At the top level of searching, query expansion is first done to the original query string to produce
multiple versions of the same query (see section on query expansion). Every query can one of the following four types:
Expand All @@ -138,7 +140,7 @@ For a maximally complex query of type 1 (including boolean operator and phrases)
these four types of queries can be permuted and experimented with to determine the importance of preserving
the additional information of phrases and boolean operators.

### Final decision here
### Final decision here !!!!!!!!!!!

Before any query is processed by the Vector Space Model (VSM) evaluation class Eval, it is parsed into a
list where each item is either a single term or a list of terms, which represents a phrase. The terms are
Expand Down Expand Up @@ -178,7 +180,15 @@ algorithm in BooleanMerge is used (see below). After the reduced postings lists
proceeds as in a non-boolean query using the VSM evaluation. This is ensure that even though a strict
intersection of all terms is enforced, the documents can still be ranked.

### Vector Space Model evaluation
In addition, given that title and content were indexed separately, it is possible to run the same query twice
to derive the cosine scores from searching the title and content fields, which are then combined using a linearly
weighted function of the form:
a*(score from title search) + b*(score from content search)
This functionality was implemented, but due to lack of training data, it was not possible to learn the appropriate
weights that should be assigned to each field. One experiment was done on assigning an equal weight, which performed
worse than a simple tf-idf baseline and was hence omitted.

## Vector Space Model evaluation

The VSM evaluation follows the lnc.ltc ranking scheme, such that we compute tf-idf for the query, but only
log(tf) for the documents. To evaluate each query, a list of (term, term frequency) tuples is created from
Expand Down Expand Up @@ -220,9 +230,13 @@ first found, followed by "B C", and the two postings lists are then merged toget

## Query expansion

### Relaxing AND and phrasal queries

### Relaxing boolean and phrasal queries

The first stage of query expansion, as explained above, involves relaxing the restrictions placed on the query from
phrases and boolean operators. Since the terms in the user queries may not be the exact terms desired, we need to
relax the AND portion of the query, so that even if the term given is not correct, the results for other parts of the
query can still be returned. To achieve a baseline tf-idf framework, all boolean operators and phrase markers were
stripped from the query string.

### WordNet/Thesaurus Query Expansion

Expand Down Expand Up @@ -255,10 +269,6 @@ retrieval. For simplicity, the original query vector is made to be a free text q
are removed and phrases are converted to single word terms. The additional documents found from relevance feedback
are appended after the already returned documents.

## Zoning



## Experimental Results

F2 results for documents appended in the following order:
Expand Down Expand Up @@ -288,20 +298,20 @@ Mean Average F2: 0.184953056130269

This performed worse than the baseline tf-idf.

....

== Files included with this submission ==

### need to regenerate class diagram

# data_helper.py - Manage the direct file reading and writing
# index.py - The driver file for indexing
# index-helper.py - The helper file for indexing, includes helper methods and data structures
# search.py - The driver file for search and query processing.
# search_helper.py - The helper file for search, query parsing and evaluation.
# PositionalMerge.py - The helper file for merging of posting and postional lists for identifying phrase queries.
# IntersectMerge.py - The helper file for merging of postings lists in Boolean queries.
# Eval.py - Evaluation class for computing cosine scores based on Vector Space Model (VSM).
# query_expander.py - ###
# QueryExpansion.py - File including code for query expansion, WordNet/thesaurus expansion, and relevance feedback.

== Statement of individual work ==

Expand Down
25 changes: 5 additions & 20 deletions search_helper.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,6 @@
PHRASE_MARKER = "\""
INVALID_TERM_DF = -1

######################## FILE READING FUNCTIONS ########################

### Retrieve the posting list for a particular term
###
def get_posting(postings_handler, dictionary, term):
'''
Retrieves the posting lists for a particular term. Each posting is
Expand All @@ -37,9 +33,6 @@ def get_posting(postings_handler, dictionary, term):
# Term does not exist in dictionary
return INVALID_TERM_DF, list()

### Retrieve a query format given the query file
###

def get_query(query):
'''
Parses a query string into a list of terms, where a term is either a single word string,
Expand Down Expand Up @@ -198,10 +191,12 @@ def process_query(postings_handler, dictionary, doc_properties, query, is_title)
[SINGLE_TERMS_WEIGHT, BIWORD_PHRASES_WEIGHT, TRIWORD_PHRASES_WEIGHT])
return score_dict

### comment properly
def get_best_documents(postings_handler, dictionary, doc_properties, query):
'''
Returns the top documents based on the content and title fields separately.
This function runs search on the top documents based on the content and title fields separately, and then
combines the cosine scores returned from
:param postings_handler: a handler to access the postings list file.
:param dictionary: the dictionary mapping terms to pointers to each posting list in the postings handler.
:param doc_properties: the dictionary mapping documents to various properties such as document vector length.
Expand Down Expand Up @@ -251,14 +246,4 @@ def relevance_feedback(postings_handler, dictionary, doc_properties, query, rele
posting_lists = get_posting_lists(postings_handler, terms, dictionary)
new_query_scores = Eval(terms, posting_lists, dictionary, doc_properties, query_vector=tf_idf).eval_query()
top_docs = get_top_scores_from_dict(new_query_scores)
return top_docs

'''
def identify_courts(query_string):
#Returns courts that exist within a query string.
courts = []
for court in COURT_HIERARCHY:
if court in query_string:
courts.append(court)
return courts
'''
return top_docs

0 comments on commit 1b23860

Please sign in to comment.