Genesis Crawler is a dark web-focused crawling tool built with Docker Compose. It consists of two main variants:
- Generic Crawler: Crawls and gathers data from generic websites.
- Specific Crawler: Loads custom parsers from a server to specifically crawl supported websites in a fine-tuned manner.
The project is designed for heavy-duty web scraping, equipped with tools to detect illegal material and categorize content. Built on top of multitor
, it provides enhanced anonymity and protection while crawling the dark web.
- Docker Compose Setup: Easy to set up and deploy via Docker Compose.
- Anonymity with multitor: Ensures anonymity while crawling dark web content.
- Customizable Crawling: Custom parsers are used to specifically crawl certain websites, giving fine-grained control over the crawling process.
- Illegal Content Detection: Equipped with tools to detect and categorize illegal content.
- Two Crawling Variants:
- Generic crawler for general data collection.
- Specific crawler with custom parsers for in-depth site crawling.
The project leverages multiple programming languages, tools, and libraries to achieve its goals:
-
Languages:
- Python: Used for the core logic, handling web requests, processing data, and managing workflows.
-
Libraries/Frameworks:
- Web Scraping & Parsing:
requests
,beautifulsoup4
,lxml
,urllib3
, andhtml-similarity
are used for fetching and parsing HTML content from the web.
- Data Processing:
pandas
,numpy
,scikit-learn
, andgensim
provide tools for data manipulation, machine learning, and natural language processing.
- Natural Language Processing (NLP):
spacy
,nltk
, andthefuzz
enable advanced text analysis and similarity checking.
- Machine Learning & AI:
transformers
,torch
, andonnxruntime
support deep learning and model inference.
- Database & Search:
elasticsearch
,pymongo
, andredis
for efficient data storage and retrieval.
- Task Management & Scheduling:
celery
,schedule
, andeventlet
handle distributed tasks and job scheduling.
- Security & Encryption:
fernet
is used for data encryption.
- Networking & Proxying:
socks
,aiohttp_socks
, andrequests[socks]
enable proxy-based web requests, especially useful for dark web crawling.
- Error Logging:
logdna
andraven
help monitor and log errors during the crawling process.
- Web Scraping & Parsing:
To get started with Genesis Crawler, follow these steps:
Clone the repository from GitHub and navigate to the project directory.
git clone https://github.com/msmannan00/Genesis-Crawler.git
cd genesis-crawler
Ensure you have Docker and Docker Compose installed on your machine. Once installed, the dependencies will be handled via Docker Compose.
Use Docker Compose to build and run the crawler:
./run.sh build
to simply start the crawler run
./run.sh
to run and update unique urls while removing not only duplicate url but also removing url that are no longer active run
./run.sh invoke_unique_crawler
This will start the crawler, which can now begin collecting data.
For specific website crawling, you can provide your own parsers. Load them onto the server and configure the crawler to use these custom parsers for enhanced scraping capabilities.
We welcome contributions to improve Genesis Crawler. If you'd like to contribute, please fork the repository and submit a pull request.
- Fork the repository.
- Create a new feature branch (
git checkout -b feature-branch
). - Commit your changes (
git commit -m 'Add some feature'
). - Push to the branch (
git push origin feature-branch
). - Create a new Pull Request.
Genesis Crawler is licensed under the MIT License.
This project is intended for research purposes only. The authors of Genesis Crawler do not support or endorse illegal activities, and users of this project are responsible for ensuring their actions comply with the law.
GitHub Repository URL: https://github.com/msmannan00/Genesis-Crawler