Skip to content

jdevelop/webspider

Folders and files

NameName
Last commit message
Last commit date
Jan 2, 2017
Feb 23, 2013
Jan 9, 2017
Jul 5, 2017
Jul 5, 2017
Jul 5, 2017
Jul 5, 2017
Dec 24, 2016
Jan 9, 2017
Dec 5, 2012
Dec 9, 2016
Jul 5, 2017
Jul 3, 2017

Repository files navigation

webspider Build Status

Open WEB spider platform. Uses Akka Cluster for distributed processing, along with Distributed PubSub.

The webspider-demo module contains the simple web application that starts one task scheduler node, and couple of web processing nodes, and exposes the interface at http://localhost:8080/

Planned features

  • extract text from HTML/PDF documents
  • process only documents, matching given patterns in names/content types
  • extract data using XPath expressions from not well-formed HTML pages or XHTML ones
  • maintain website graph (links between ancestor / successor pages)
  • process websites behind the authentication (HTTP Basic/Digest, Form-Based authentication)
  • handle failures and restart processing from point where application was aborted
  • provide extension API for document type handlers, protocol handlers
  • concurrent processing of website pages
  • minimize traffic using bzip/gzip encoding when possible, avoid donloading of same link twice or more times

Supported protocols:

  • HTTP(S)