The repository of Poogle search engine. Final report is available here.
We use Apache Maven to manage compilation, testing, and execution.
We are using AWS DynamoDB.
To run the code, you'll need to create the following DynamoDB tables.
URL
: usesurl
as the partition key and contains a global secondary indexmd5-weight-index
which usesmd5
as the partition key andweight
as the sort key. The table also has attributesdate
andoutboundLinks
.DOCUMENT
: usesmd5
as the partition key and contains attributesdate
anddocument
.INVIDX
: usesword
as the partition key andmd5
as the sort key.
If trying to run each part of the code, you'll need to add and apply the following command in Run Configuration
clean install exec:java@crawler # Run the crawler and update the database
clean install exec:java@pagerank # Run the pagerank MapReduce job and update the database
clean install exec:java@indexer # Run the indexer MapReduce job and update the database
clean install exec:java@server # Start the search engine server
For the frontend on the client side, you'll need to go to ./client directory and execute npm start
. Open http://localhost:3000 to view the web page in a browser.
We have two major version for crawler, one implemented with ThreadPool, and another implemented with Apache Storm and Kafka.
Main entrypoint is edu.upenn.cis.cis455.crawler.Crawler
, currently all crawler file is on crawler-k8s-threadpool
branch. The crawler uses bloom filter to remove duplicate and seen urls, stores a LRU cache for robots.txt, and group several urls for batch updates. Non-HTML content is parsed with Apache Tika. All web metadata and documents are stored in Amazon DynamoDB.
We intended to deploy the thread-pool version of distributed crawler with kubernetes, each crawler functions as a seperate program except they will share a common url queue, hosted by Amazon SQS. We will also explore the more powerful distributed crawler implemented with Apache Storm and Kafka. Our plan is to host our Storm and Kafka (nimbus, zookeepers, supervisors, etc.) on a kubernetes cluster.
We have implemented an EMR-based PageRank.
Main function is located at edu.upenn.cis.cis455.pagerank.PageRankInterface
, it takes in three arguments:
- Input file location containing urls and their outbound links.
- Desired output directory.
- A boolean that is true if we want to distribute less weight to intra-domain links, false if we want to treat intra-domain and inter-domain links the same.
We have implemented an EMR-based indexer that creates inverted index for the crawled document corpus.
We used React.js to develop the frontend of the search engine. We developed the frontend with reference to this Medium article. Codes were adopted from https://github.com/5ebs/Google-Clone with large modification. Users are able to see the url link and a snippet of preview of the web page. Our search engine caches the search results to BerkeleyDB. When a user searches the same query, the search engine will respond quickly with the result from cache.
- Crawler can handle non-html data.
- Crawler can store partial metadata about the web documents.
- Indexer uses metadata of web pages to improve the rankings
- Crawler:
edu.upenn.cis.cis455.crawler
- PageRank:
edu.upenn.cis.cis455.pagerank
- Indexer:
edu.upenn.cis.cis455.indexer
- Search Engine:
edu.upenn.cis.cis455.searchengine