Skip to content
/ searchy Public

Implementation of a search engine on the cacm and CS276 (Stanford) collections.

Notifications You must be signed in to change notification settings

pvnieo/searchy

Repository files navigation

Moteur de recherche

Build Status

Implémentation d'un moteur de recherche pour une collection de fichiers.

Installation

Searchy tourne sous python >= 3.6, utilisez pip pour installer les dépendances

pip3 install -r requirements.txt

Installez les dépendances demandées par nltk avec la commande suivante:

python3 -c "import nltk; nltk.download('punkt'); nltk.download('stopwords'); nltk.download('wordnet');"

Usage

Utilisez le script searchy.py pour indexer une collection:

usage: searchy.py [-h] [-q QUERY] [-m {bool,vect}]
                  [-n {cos,dice,jaccard,overlap}] [-t THRESHOLD]
                  [-w {f,tfidf,nf}] [-s] [-f] [--no-cache]
                  collection

Builds a search engine on a collection of documents

positional arguments:
  collection            Path to collection file (CACM format), directory or
                        url to zip

optional arguments:
  -h, --help            show this help message and exit
  -q QUERY, --query QUERY
                        Execute a search query
  -m {bool,vect}, --model {bool,vect}
                        Search engine model
  -n {cos,dice,jaccard,overlap}, --norm {cos,dice,jaccard,overlap}
                        Vectorial search norm
  -t THRESHOLD, --threshold THRESHOLD
                        Vectorial search norm threshold
  -w {f,tfidf,nf}, --weighting {f,tfidf,nf}
                        Vectorial weighting method
  -s, --silent          Disable verbose mode
  -f, --force           Force re-indexing overwrite cache
  --no-cache            Disable disk cache

Exemple d'usage

Model vectoriel

Les requêtes sont des phrases. Ici on chechre dans la collection CACM.

$ ./searchy.py data/CACM/cacm.all
Loading data/CACM/cacm.all
Using cache 64f76a63
  documents 	 3204
  tokens 	 113754
  terms 	 5961
memory: 0.42 mb
🔍  > Processes and Proofs of Theorems and Programs
 -----
 3079. An Algorithm for Reasoning About Equality [93.99%]
 -----
.T
An Algorithm for Reasoning About Equality
.W
A simple technique for reasoning about equalities
that is fast and complete for ground formulas
...
 -----
 3140. Social Processes and Proofs of Theorems and Programs [93.87%]
 -----
.T
Social Processes and Proofs of Theorems and Programs
.W
It is argued that formal verifications of
programs, no matter how obtained, will not play the
same key role in the development of computer science and software
engineering as proofs do in mathematics.  Furthermore the absence
...

total results: 260     2.94 s

Pour charger la collection Stanford de manière rapide, vous pouvez la télécharger et l'extraire dans le dossier dumps/pa1-data/pa1-data pour avoir une structure similaire à

dumps/pa1-data/pa1-data/0
dumps/pa1-data/pa1-data/1
...
dumps/pa1-data/pa1-data/9

Et puis charger la avec searchy:

$ ./searchy.py dumps/pa1-data

Sinon on peut utiliser l'url directement comme argument ce qui fera l'opération précédente automatiquement.

$ ./searchy.py http://web.stanford.edu/class/cs276/pa/pa1-data.zip

Model booléen

Les requêtes doivent être au format booléen suivant: (mot1 & mot2) | ~mot3 les opérateurs booléen autorisés sont: & (et), | (ou), ~ (négation).

$ ./searchy.py -m bool data/CACM/cacm.all
Loading data/CACM/cacm.all
Using cache 64f76a63
  documents 	 3204
  tokens 	 113754
  terms 	 5961
memory: 0.42 mb
🔍  > processes & Proofs & theorems & programs
 -----
 3140. Social Processes and Proofs of Theorems and Programs [100.00%]
 -----
.T
Social Processes and Proofs of Theorems and Programs
.W
It is argued that formal verifications of
programs, no matter how obtained, will not play the
same key role in the development of computer science and software
engineering as proofs do in mathematics.  Furthermore the absence
of continuity, the inevitability of change, and the complexity of
specification of significantly many real programs make the form
al verification process difficult to justify and manage.  It is felt
that ease of formal verification should not dominate program
language design.
.K
Formal mathematics, mathematical proofs,
program verification, program specification
2.10 4.6 5.24

total results: 1     2.96 s

About

Implementation of a search engine on the cacm and CS276 (Stanford) collections.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published