Skip to content

arandomguyhere/News_Feeder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

28 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ” OSINT Story Aggregator - Mosaic Intelligence

Tests Lint Security Pages

A Python-based OSINT (Open Source Intelligence) tool that collects stories from multiple sources and identifies related content to build comprehensive intelligence pictures. Think of it as mosaic intelligence - individual stories are tiles that contribute to a bigger picture.

🌐 Live Demo: https://arandomguyhere.github.io/News_Feeder/

🎯 Features

  • Multi-Source Collection: Gathers stories from:

    • NewsAPI (80k+ news sources)
    • GDELT (Global news events database)
    • Google News scraping
    • Bing News scraping
  • NLP Processing:

    • Named Entity Recognition (people, organizations, locations, events)
    • Keyword extraction
    • Text similarity analysis
  • Story Correlation:

    • Finds related stories across sources
    • Clusters stories by similarity
    • Identifies shared entities and topics
    • Temporal and semantic relationship detection
  • Output Formats:

    • JSON (stories, clusters, summaries)
    • HTML reports with visualizations
    • Entity graphs and timelines

πŸš€ Quick Start

Prerequisites

  • Python 3.8+
  • pip
  • Internet connection

Installation

  1. Clone the repository:
git clone https://github.com/arandomguyhere/News_Feeder.git
cd News_Feeder
  1. Run the setup script:
./setup.sh

This will:

  • Create a virtual environment
  • Install all dependencies
  • Download spaCy language model
  • Create necessary directories
  • Set up configuration files
  1. Configure API keys (optional):
cp .env.example .env
# Edit .env and add your NewsAPI key (optional)

Usage

Activate the virtual environment:

source venv/bin/activate

Run the aggregator:

python aggregator.py

πŸ“Œ Which Script to Use?

  • aggregator.py - Recommended - Multi-source (NewsAPI, GDELT, Google/Bing News)
  • mosaic_intelligence.py - Specialized for Google News with 50+ cyber threat searches

Use aggregator.py for general OSINT, or mosaic_intelligence.py for focused cybersecurity intelligence.

The aggregator will:

  1. Collect stories from all configured sources
  2. Process them with NLP to extract entities and keywords
  3. Find related stories and create clusters
  4. Generate output reports in data/output/

πŸ“‹ Configuration

Edit config/config.yaml to customize:

  • Data sources: Enable/disable collectors, add queries
  • Processing settings: Similarity thresholds, entity types
  • Output options: Formats, report types

Example configuration:

sources:
  newsapi:
    enabled: true
    queries:
      - "cybersecurity attack"
      - "data breach"

  gdelt:
    enabled: true
    queries:
      - "cyberattack"
      - "intelligence"

processing:
  similarity_threshold: 0.15  # 0.0-1.0 (lower = more connections)
  max_story_age_hours: 48

output:
  format: "both"  # json, html, or both

πŸ“Š Understanding Output

JSON Files

  • stories_*.json: All collected stories with metadata
  • clusters_*.json: Related story clusters with shared entities
  • summary_*.json: High-level statistics and overview

HTML Reports

Interactive reports showing:

  • Story clusters by topic
  • Shared entities across stories
  • Timeline of events
  • Source distribution

πŸ”‘ API Keys

NewsAPI (Optional)

Get a free API key at newsapi.org:

  • Free tier: 100 requests/day
  • Access to 80,000+ news sources

Add to .env:

NEWSAPI_KEY=your_key_here

Note: The aggregator works without NewsAPI using GDELT and web scraping, but NewsAPI provides richer content.

πŸ—οΈ Architecture

News_Feeder/
β”œβ”€β”€ aggregator.py          # Main orchestration script
β”œβ”€β”€ config/
β”‚   └── config.yaml        # Configuration file
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ collectors/        # Data collection modules
β”‚   β”‚   β”œβ”€β”€ newsapi_collector.py
β”‚   β”‚   β”œβ”€β”€ gdelt_collector.py
β”‚   β”‚   └── web_scraper_collector.py
β”‚   β”œβ”€β”€ processors/        # NLP processing
β”‚   β”‚   └── nlp_processor.py
β”‚   └── correlators/       # Story correlation
β”‚       └── story_correlator.py
β”œβ”€β”€ data/
β”‚   └── output/           # Generated reports
└── logs/                 # Log files

πŸ”§ Advanced Usage

Custom Queries

Edit config/config.yaml to add your own search queries:

sources:
  gdelt:
    queries:
      - "your topic here"
      - "another topic"

Adjusting Similarity Threshold

Higher threshold = stricter matching (fewer, more related clusters):

processing:
  similarity_threshold: 0.5  # Range: 0.0-1.0

Adding New Collectors

Extend BaseCollector class in src/collectors/:

class MyCollector(BaseCollector):
    def collect(self) -> List[Story]:
        # Your collection logic
        pass

πŸ“ˆ Use Cases

  • Threat Intelligence: Track cybersecurity incidents across sources
  • Geopolitical Analysis: Monitor evolving situations globally
  • Brand Monitoring: Find all mentions of topics/entities
  • Research: Aggregate information on specific subjects
  • News Analysis: Understand story coverage patterns

🀝 Contributing

Contributions welcome! Areas for improvement:

  • Additional data sources (Twitter, Reddit, etc.)
  • Advanced NLP (semantic embeddings, topic modeling)
  • Graph database integration (Neo4j)
  • Real-time monitoring
  • Web dashboard
  • Advanced visualizations

πŸ“ License

MIT License - see LICENSE file

⚠️ Ethical Use

This tool is for defensive security and research purposes only:

  • βœ… Threat intelligence gathering
  • βœ… Security research
  • βœ… News analysis
  • βœ… Academic research
  • ❌ Malicious reconnaissance
  • ❌ Privacy invasion
  • ❌ Unauthorized data harvesting

Always respect:

  • Website terms of service
  • Rate limiting
  • robots.txt files
  • Privacy laws and regulations

πŸ› Troubleshooting

"spaCy model not found":

python -m spacy download en_core_web_sm

"No stories collected":

  • Check your internet connection
  • Verify API keys (if using NewsAPI)
  • Check logs in logs/aggregator.log

"Import errors":

source venv/bin/activate
pip install -r requirements.txt

πŸ“š Resources

πŸ“ž Support

For issues and questions:

  • Open an issue on GitHub
  • Check existing documentation
  • Review logs in logs/

Built with Python, spaCy, and open-source intelligence principles

🌐 View Reports Online (GitHub Pages)

You can view your generated reports online using GitHub Pages!

Your GitHub Pages URL: https://arandomguyhere.github.io/News_Feeder/

Setup (One-Time)

  1. Go to Settings β†’ Pages in your GitHub repository
  2. Under "Build and deployment", set Source to GitHub Actions
  3. Go to Actions tab β†’ Deploy to GitHub Pages β†’ Run workflow
  4. Wait 2-3 minutes and visit your Pages URL

See GITHUB_PAGES_SETUP.md for detailed instructions.

What You'll See

  • πŸ“Š Interactive HTML reports
  • πŸ“„ JSON data exports
  • πŸ”— Story connections and clusters
  • πŸ“ˆ Timeline visualizations

Reports update automatically on push to main, or manually via Actions.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •