🔍 OSINT Story Aggregator - Mosaic Intelligence

A Python-based OSINT (Open Source Intelligence) tool that collects stories from multiple sources and identifies related content to build comprehensive intelligence pictures. Think of it as mosaic intelligence - individual stories are tiles that contribute to a bigger picture.

🌐 Live Demo: https://arandomguyhere.github.io/News_Feeder/

🎯 Features

Multi-Source Collection: Gathers stories from:
- NewsAPI (80k+ news sources)
- GDELT (Global news events database)
- Google News scraping
- Bing News scraping
NLP Processing:
- Named Entity Recognition (people, organizations, locations, events)
- Keyword extraction
- Text similarity analysis
Story Correlation:
- Finds related stories across sources
- Clusters stories by similarity
- Identifies shared entities and topics
- Temporal and semantic relationship detection
Output Formats:
- JSON (stories, clusters, summaries)
- HTML reports with visualizations
- Entity graphs and timelines

🚀 Quick Start

Prerequisites

Python 3.8+
pip
Internet connection

Installation

Clone the repository:

git clone https://github.com/arandomguyhere/News_Feeder.git
cd News_Feeder

Run the setup script:

./setup.sh

This will:

Create a virtual environment
Install all dependencies
Download spaCy language model
Create necessary directories
Set up configuration files

Configure API keys (optional):

cp .env.example .env
# Edit .env and add your NewsAPI key (optional)

Usage

Activate the virtual environment:

source venv/bin/activate

Run the aggregator:

python aggregator.py

📌 Which Script to Use?

aggregator.py - Recommended - Multi-source (NewsAPI, GDELT, Google/Bing News)

mosaic_intelligence.py - Specialized for Google News with 50+ cyber threat searches

Use aggregator.py for general OSINT, or mosaic_intelligence.py for focused cybersecurity intelligence.

The aggregator will:

Collect stories from all configured sources
Process them with NLP to extract entities and keywords
Find related stories and create clusters
Generate output reports in data/output/

📋 Configuration

Edit config/config.yaml to customize:

Data sources: Enable/disable collectors, add queries
Processing settings: Similarity thresholds, entity types
Output options: Formats, report types

Example configuration:

sources:
  newsapi:
    enabled: true
    queries:
      - "cybersecurity attack"
      - "data breach"

  gdelt:
    enabled: true
    queries:
      - "cyberattack"
      - "intelligence"

processing:
  similarity_threshold: 0.15  # 0.0-1.0 (lower = more connections)
  max_story_age_hours: 48

output:
  format: "both"  # json, html, or both

📊 Understanding Output

JSON Files

stories_*.json: All collected stories with metadata
clusters_*.json: Related story clusters with shared entities
summary_*.json: High-level statistics and overview

HTML Reports

Interactive reports showing:

Story clusters by topic
Shared entities across stories
Timeline of events
Source distribution

🔑 API Keys

NewsAPI (Optional)

Get a free API key at newsapi.org:

Free tier: 100 requests/day
Access to 80,000+ news sources

Add to .env:

NEWSAPI_KEY=your_key_here

Note: The aggregator works without NewsAPI using GDELT and web scraping, but NewsAPI provides richer content.

🏗️ Architecture

News_Feeder/
├── aggregator.py          # Main orchestration script
├── config/
│   └── config.yaml        # Configuration file
├── src/
│   ├── collectors/        # Data collection modules
│   │   ├── newsapi_collector.py
│   │   ├── gdelt_collector.py
│   │   └── web_scraper_collector.py
│   ├── processors/        # NLP processing
│   │   └── nlp_processor.py
│   └── correlators/       # Story correlation
│       └── story_correlator.py
├── data/
│   └── output/           # Generated reports
└── logs/                 # Log files

🔧 Advanced Usage

Custom Queries

Edit config/config.yaml to add your own search queries:

sources:
  gdelt:
    queries:
      - "your topic here"
      - "another topic"

Adjusting Similarity Threshold

Higher threshold = stricter matching (fewer, more related clusters):

processing:
  similarity_threshold: 0.5  # Range: 0.0-1.0

Adding New Collectors

Extend BaseCollector class in src/collectors/:

class MyCollector(BaseCollector):
    def collect(self) -> List[Story]:
        # Your collection logic
        pass

📈 Use Cases

Threat Intelligence: Track cybersecurity incidents across sources
Geopolitical Analysis: Monitor evolving situations globally
Brand Monitoring: Find all mentions of topics/entities
Research: Aggregate information on specific subjects
News Analysis: Understand story coverage patterns

🤝 Contributing

Contributions welcome! Areas for improvement:

Additional data sources (Twitter, Reddit, etc.)
Advanced NLP (semantic embeddings, topic modeling)
Graph database integration (Neo4j)
Real-time monitoring
Web dashboard
Advanced visualizations

📝 License

MIT License - see LICENSE file

⚠️ Ethical Use

This tool is for defensive security and research purposes only:

✅ Threat intelligence gathering
✅ Security research
✅ News analysis
✅ Academic research
❌ Malicious reconnaissance
❌ Privacy invasion
❌ Unauthorized data harvesting

Always respect:

Website terms of service
Rate limiting
robots.txt files
Privacy laws and regulations

🐛 Troubleshooting

"spaCy model not found":

python -m spacy download en_core_web_sm

"No stories collected":

Check your internet connection
Verify API keys (if using NewsAPI)
Check logs in logs/aggregator.log

"Import errors":

source venv/bin/activate
pip install -r requirements.txt

📚 Resources

📞 Support

For issues and questions:

Open an issue on GitHub
Check existing documentation
Review logs in logs/

Built with Python, spaCy, and open-source intelligence principles

🌐 View Reports Online (GitHub Pages)

You can view your generated reports online using GitHub Pages!

Your GitHub Pages URL: https://arandomguyhere.github.io/News_Feeder/

Setup (One-Time)

Go to Settings → Pages in your GitHub repository
Under "Build and deployment", set Source to GitHub Actions
Go to Actions tab → Deploy to GitHub Pages → Run workflow
Wait 2-3 minutes and visit your Pages URL

See GITHUB_PAGES_SETUP.md for detailed instructions.

What You'll See

📊 Interactive HTML reports
📄 JSON data exports
🔗 Story connections and clusters
📈 Timeline visualizations

Reports update automatically on push to main, or manually via Actions.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.github/workflows		.github/workflows
config		config
data		data
docs		docs
logs		logs
src		src
.env.example		.env.example
.gitignore		.gitignore
GITHUB_PAGES_QUICKSTART.md		GITHUB_PAGES_QUICKSTART.md
GITHUB_PAGES_SETUP.md		GITHUB_PAGES_SETUP.md
LICENSE		LICENSE
README.md		README.md
STATUS.md		STATUS.md
TESTING.md		TESTING.md
USAGE.md		USAGE.md
aggregator.py		aggregator.py
combined_intelligence.py		combined_intelligence.py
create_index.py		create_index.py
mosaic_intelligence.py		mosaic_intelligence.py
requirements.txt		requirements.txt
setup.sh		setup.sh
test_aggregator.py		test_aggregator.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔍 OSINT Story Aggregator - Mosaic Intelligence

🎯 Features

🚀 Quick Start

Prerequisites

Installation

Usage

📋 Configuration

📊 Understanding Output

JSON Files

HTML Reports

🔑 API Keys

NewsAPI (Optional)

🏗️ Architecture

🔧 Advanced Usage

Custom Queries

Adjusting Similarity Threshold

Adding New Collectors

📈 Use Cases

🤝 Contributing

📝 License

⚠️ Ethical Use

🐛 Troubleshooting

📚 Resources

📞 Support

🌐 View Reports Online (GitHub Pages)

Setup (One-Time)

What You'll See

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

arandomguyhere/News_Feeder

Folders and files

Latest commit

History

Repository files navigation

🔍 OSINT Story Aggregator - Mosaic Intelligence

🎯 Features

🚀 Quick Start

Prerequisites

Installation

Usage

📋 Configuration

📊 Understanding Output

JSON Files

HTML Reports

🔑 API Keys

NewsAPI (Optional)

🏗️ Architecture

🔧 Advanced Usage

Custom Queries

Adjusting Similarity Threshold

Adding New Collectors

📈 Use Cases

🤝 Contributing

📝 License

⚠️ Ethical Use

🐛 Troubleshooting

📚 Resources

📞 Support

🌐 View Reports Online (GitHub Pages)

Setup (One-Time)

What You'll See

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages