A Python-based OSINT (Open Source Intelligence) tool that collects stories from multiple sources and identifies related content to build comprehensive intelligence pictures. Think of it as mosaic intelligence - individual stories are tiles that contribute to a bigger picture.
π Live Demo: https://arandomguyhere.github.io/News_Feeder/
-
Multi-Source Collection: Gathers stories from:
- NewsAPI (80k+ news sources)
- GDELT (Global news events database)
- Google News scraping
- Bing News scraping
-
NLP Processing:
- Named Entity Recognition (people, organizations, locations, events)
- Keyword extraction
- Text similarity analysis
-
Story Correlation:
- Finds related stories across sources
- Clusters stories by similarity
- Identifies shared entities and topics
- Temporal and semantic relationship detection
-
Output Formats:
- JSON (stories, clusters, summaries)
- HTML reports with visualizations
- Entity graphs and timelines
- Python 3.8+
- pip
- Internet connection
- Clone the repository:
git clone https://github.com/arandomguyhere/News_Feeder.git
cd News_Feeder- Run the setup script:
./setup.shThis will:
- Create a virtual environment
- Install all dependencies
- Download spaCy language model
- Create necessary directories
- Set up configuration files
- Configure API keys (optional):
cp .env.example .env
# Edit .env and add your NewsAPI key (optional)Activate the virtual environment:
source venv/bin/activateRun the aggregator:
python aggregator.pyπ Which Script to Use?
aggregator.py- Recommended - Multi-source (NewsAPI, GDELT, Google/Bing News)mosaic_intelligence.py- Specialized for Google News with 50+ cyber threat searchesUse
aggregator.pyfor general OSINT, ormosaic_intelligence.pyfor focused cybersecurity intelligence.
The aggregator will:
- Collect stories from all configured sources
- Process them with NLP to extract entities and keywords
- Find related stories and create clusters
- Generate output reports in
data/output/
Edit config/config.yaml to customize:
- Data sources: Enable/disable collectors, add queries
- Processing settings: Similarity thresholds, entity types
- Output options: Formats, report types
Example configuration:
sources:
newsapi:
enabled: true
queries:
- "cybersecurity attack"
- "data breach"
gdelt:
enabled: true
queries:
- "cyberattack"
- "intelligence"
processing:
similarity_threshold: 0.15 # 0.0-1.0 (lower = more connections)
max_story_age_hours: 48
output:
format: "both" # json, html, or bothstories_*.json: All collected stories with metadataclusters_*.json: Related story clusters with shared entitiessummary_*.json: High-level statistics and overview
Interactive reports showing:
- Story clusters by topic
- Shared entities across stories
- Timeline of events
- Source distribution
Get a free API key at newsapi.org:
- Free tier: 100 requests/day
- Access to 80,000+ news sources
Add to .env:
NEWSAPI_KEY=your_key_here
Note: The aggregator works without NewsAPI using GDELT and web scraping, but NewsAPI provides richer content.
News_Feeder/
βββ aggregator.py # Main orchestration script
βββ config/
β βββ config.yaml # Configuration file
βββ src/
β βββ collectors/ # Data collection modules
β β βββ newsapi_collector.py
β β βββ gdelt_collector.py
β β βββ web_scraper_collector.py
β βββ processors/ # NLP processing
β β βββ nlp_processor.py
β βββ correlators/ # Story correlation
β βββ story_correlator.py
βββ data/
β βββ output/ # Generated reports
βββ logs/ # Log files
Edit config/config.yaml to add your own search queries:
sources:
gdelt:
queries:
- "your topic here"
- "another topic"Higher threshold = stricter matching (fewer, more related clusters):
processing:
similarity_threshold: 0.5 # Range: 0.0-1.0Extend BaseCollector class in src/collectors/:
class MyCollector(BaseCollector):
def collect(self) -> List[Story]:
# Your collection logic
pass- Threat Intelligence: Track cybersecurity incidents across sources
- Geopolitical Analysis: Monitor evolving situations globally
- Brand Monitoring: Find all mentions of topics/entities
- Research: Aggregate information on specific subjects
- News Analysis: Understand story coverage patterns
Contributions welcome! Areas for improvement:
- Additional data sources (Twitter, Reddit, etc.)
- Advanced NLP (semantic embeddings, topic modeling)
- Graph database integration (Neo4j)
- Real-time monitoring
- Web dashboard
- Advanced visualizations
MIT License - see LICENSE file
This tool is for defensive security and research purposes only:
- β Threat intelligence gathering
- β Security research
- β News analysis
- β Academic research
- β Malicious reconnaissance
- β Privacy invasion
- β Unauthorized data harvesting
Always respect:
- Website terms of service
- Rate limiting
- robots.txt files
- Privacy laws and regulations
"spaCy model not found":
python -m spacy download en_core_web_sm"No stories collected":
- Check your internet connection
- Verify API keys (if using NewsAPI)
- Check logs in
logs/aggregator.log
"Import errors":
source venv/bin/activate
pip install -r requirements.txtFor issues and questions:
- Open an issue on GitHub
- Check existing documentation
- Review logs in
logs/
Built with Python, spaCy, and open-source intelligence principles
You can view your generated reports online using GitHub Pages!
Your GitHub Pages URL: https://arandomguyhere.github.io/News_Feeder/
- Go to Settings β Pages in your GitHub repository
- Under "Build and deployment", set Source to GitHub Actions
- Go to Actions tab β Deploy to GitHub Pages β Run workflow
- Wait 2-3 minutes and visit your Pages URL
See GITHUB_PAGES_SETUP.md for detailed instructions.
- π Interactive HTML reports
- π JSON data exports
- π Story connections and clusters
- π Timeline visualizations
Reports update automatically on push to main, or manually via Actions.