This tool allows you to collect and consolidate data from various sources, including GitHub repositories and websites. It also provides functionality to consolidate the collected data into a specified token context window using Ollama and Llama models.
- Python 3.7 or higher
- pip (Python package installer)
-
Clone this repository:
git clone https://github.com/bacalhau-project/scraper.git cd scraper
-
Install the required Python packages:
uv venv .venv --seed source .venv/bin/activate uv pip install -r requirements.txt
-
Install Ollama:
- For Linux:
curl https://ollama.ai/install.sh | sh
- For MacOS:
brew install ollama
- For Windows: Download the installer from Ollama's official website
- For Linux:
-
Pull the Llama3.1 model using Ollama:
ollama pull llama3.1:8b
The script provides several options:
-
Download data:
python main.py --download
-
Consolidate data into multiple files (max 5MB each):
python main.py --consolidate
-
Consolidate data to a specific token context size using Ollama:
python main.py --context-consolidate 2048 --model llama3.1:8b
You can adjust the context size (e.g., 2048) and the model name as needed.
-
Perform all operations:
python main.py --download --consolidate --context-consolidate 2048
Edit the config.json
file to specify:
- Output directory
- GitHub repositories to clone/update
- Websites to scrape
- Maximum depth for web crawling
- Maximum pages per site
- Number of worker threads
If you encounter any issues with Ollama or the Llama model:
-
Ensure Ollama is running:
ollama serve
-
Check available models:
ollama list
-
If the Llama3.1 model is missing, pull it again:
ollama pull llama3.1:8b
-
For more detailed Ollama usage, refer to the Ollama documentation.
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.