A minimal, async Clojure web crawling library.
("linkin" from Linkin Park's "Crawling", and because links)
- Uses http-kit and core.async for async fetching
- Uses Jsoup to reliably extract links from scraped pages
- Allows the user to pass in their own function to handle the body content of scraped pages
- Respects the robots.txt of the target website (allow & disallow rules)
Right now it's not on clojars yet, so you'll need to build and install it locally using a method of your own choosing (lein-localrepo is a good choice)
Then include the following dependency in your project.clj:
[linkin "0.1.0-SNAPSHOT"]
Then:
(require 'linkin.core)
(linkin.core/crawl "http://example.com" linkin.core/simple-body-parser)
- Throttling (ie don't DoS target sites...)
- Control throttling using the spider delay directives in robots.txt
- Control of max depth / number of pages crawled
- Ability to spider across domains
- Pass options through to http-kit (eg following redirects)
- Filtering by content type
- Stats while running
- Better URL normalization (for detecting URLs we've seen before) - see http://en.wikipedia.org/wiki/URL_normalization
Copyright © 2014 Rory Gibson
Distributed under the Eclipse Public License version 1.0.