Skip to content

This is open source web crawler example based on Java technologies

Notifications You must be signed in to change notification settings

vishalzanzrukia/java-web-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

java-web-crawler

This is open source web crawler example based on Java technologies with following features.

  • Auto Restart after once cycle finished
  • Configuration to set time between two cycles
  • Capability to start crawling process with same state in case of JVM crash/down or Server crash/down where it left while crash/shutdown occurred.
  • Configuration to run crawler processes with different domains.
  • Configuration to set domain wise different set of url filters
  • Configuration to set domain wise different parsers
  • Configuration to set robots.txt rules enable/disable
  • Configuration to set maximum url visit per second
  • Configuration to set maximum depth to visit
  • Configuration to set maximum bytes per page to download
  • Sitemaps parsing support
  • Retry support with parsing

Technology Stack

  • Spring Boot
  • Spring Integration
  • Redis
  • Jsoup
  • ActiveMQ
  • ElasticSearch

NOTE : It's still ongoing project, not ready to use yet.

About

This is open source web crawler example based on Java technologies

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published