Skip to content

Dezento/crawlify

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 

Repository files navigation

Crawlify

Installation

composer require dezento/crawlify

Overview

Crawlify is a lightweight crawler for manipulating HTML,XML and JSON using DomCrawler. It uses GuzzleHttp\Pool to make concurrent request and that means you can use all Request Options available.
The result it gives back is wrapped with Laravel Collections.

Examples

CRAWL JSON
use Dezento\Crawlify;


$links = [];
for ($i = 1; $i <= 100; $i++) {
    $links[] = 'https://jsonplaceholder.typicode.com/posts/' . $i ;
}

$json = (new Crawlify(collect($links))) // you can pass Array or Collection of links
->settings([
  'type' => 'JSON'  //this is Crawlify Option
])
->fetch()
->get('fulfilled')
->map(fn ($p) => collect(json_decode($p->response)))
->dd();
CRAWL XML

For traversing XML refer to DomCrawler documentation.

$xml = (new Crawlify([
    'https://www.nytimes.com/svc/collections/v1/publish/https://www.nytimes.com/section/world/rss.xml',
]))
->fetch()
->get('fulfilled')
->map(fn ($item) =>
  collect($item->response->filter('item')->children())
  ->map(fn ($data) => $data->textContent)
)->dd();

CRAWL HTML

For traversing HTML refer to DomCrawler documentation.

$html = (new Crawlify([
  'https://en.wikipedia.org/wiki/Category:Lists_of_spider_species_by_family'
]))
->settings([
  #'proxy' => 'http://username:[email protected]:10',
  'concurrency' => 5,
  'delay' => 0
])
->fetch()
->get('fulfilled')
->map(fn ($item) =>
  collect($item->response->filter('a')->links())
  ->map(fn($el) => $el->getUri())
)
->reject(fn($a) => $a->isEmpty())
->dd();

OPTIONS
->settings([
  'proxy' => 'http://username:[email protected]:10',
  'concurrency' => 5,
  'delay' => 0,
  ....
])

For options you can refer to Request Options documentation. The only Crawlify custom options is 'type' => 'JSON'

Note

Before using dd() helper you must install it.

composer require symfony/var-dumper

About

Simple Concurrent Crawler

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages