-
Notifications
You must be signed in to change notification settings - Fork 3
nilayjain/text-search-engine
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
We've used Vector Space model. The corpus consists of around 1550 documents. So we're attaching the corpus with the assignment. From the directory that this readme file is in, type in the following command to run the code: python ir.py The script takes a few seconds to run. Type in your query when you're asked on console. The ranked results will be displayed, with docid and weight. On typing the empty query(i.e. just pressing enter) the program will exit. Source of the Dataset: http://qwone.com/~jason/20Newsgroups/ 20 Newsgroups sorted by date; duplicates and some headers removed. Our corpus consists of 1500 documents. You can run the code on other documents to by pasting the documents in the corpus folder and rename the documents as doc**** where **** is a 4 digit number. NOTE: 1. Because we have used the Vector Space model, the terms precision and recall do not apply. 2. We've given the output in ascending cosine score (ascending order of relevance) so that the most relevant result can be seen on that page itself. (You won't have to scroll up to get to the first result.) TEST CASES: 1. query : hello world Last 10 lines of Output: The docid is 840 and the weight is 0.0123868243244 The docid is 414 and the weight is 0.0125227040535 The docid is 795 and the weight is 0.0126194983527 The docid is 075 and the weight is 0.0127447172068 The docid is 446 and the weight is 0.0128253567852 The docid is 442 and the weight is 0.0131724993109 The docid is 336 and the weight is 0.0131857066489 The docid is 339 and the weight is 0.0133365372412 The docid is 293 and the weight is 0.0137057664574 The docid is 828 and the weight is 0.0171555644517 ____________________________________ 2. query : help me Last 10 lines of Output: The docid is 222 and the weight is 0.064405360291 The docid is 714 and the weight is 0.0645867135002 The docid is 717 and the weight is 0.0654263948886 The docid is 152 and the weight is 0.0670786638134 The docid is 583 and the weight is 0.0719265769142 The docid is 858 and the weight is 0.0886557609352 The docid is 431 and the weight is 0.092779776323 The docid is 197 and the weight is 0.0941716886973 The docid is 313 and the weight is 0.115243508121 The docid is 322 and the weight is 0.129980295508 _______________________________________ 3. query : please give us full marks Last 10 lines of Output: The docid is 359 and the weight is 0.0304982690928 The docid is 903 and the weight is 0.0342245838741 The docid is 908 and the weight is 0.0370779530946 The docid is 029 and the weight is 0.0377193952148 The docid is 379 and the weight is 0.0389999670468 The docid is 079 and the weight is 0.0443838393164 The docid is 1219 and the weight is 0.0458483706654 The docid is 319 and the weight is 0.0515440732627 The docid is 330 and the weight is 0.0539118537392 The docid is 1501 and the weight is 0.0554254846328 THANK YOU
About
A simple text search engine in python that uses vector space model.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published