Skip to content
This repository has been archived by the owner on Oct 17, 2020. It is now read-only.
/ crawler Public archive

Explore the web in parallel on thousands of machines

Notifications You must be signed in to change notification settings

short-d/crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Crawler

Explore the web in parallel on thousands of machines

TODO

  • Productionize master & worker with app framework
  • Configure continuous delivery to bootstrap productivity
  • Health check for workers & master
  • Abstract out worker scheduling to support custom algorithms
  • Assign job to new worker when previously assigned worker failed
  • Support checkpoint for master
  • Auto recover system when the master is back online
  • Support pluggable worker script
  • Load worker script from network file system
  • TLS (Transport Layer Security)
  • Create CLI to trigger crawling for a certain site
  • Support Docker Swam
  • Support k8s

Getting Started

cd bin

Start Master

go run master.go 8080

Output

Master started at 8080

Start Workers

Run the following command at different terminals:

go run worker.go 8081 localhost 8080
go run worker.go 8082 localhost 8080
go run worker.go 8083 localhost 8080

Output

At the master side:

Worker registed: ID(0) IP(localhost) PORT(8081) SECRET(encrypted)
Worker registed: ID(1) IP(localhost) PORT(8082) SECRET(encrypted)
Worker registed: ID(2) IP(localhost) PORT(8083) SECRET(encrypted)

At the work side:

Registered with master: id(0)
Registered with master: id(1)
Registered with master: id(2)

Try It Out

Send the following gRPC calls to the master:

{
 "url": "https://leetcode.com"
}

Output

On the master side:

Start exploring https://leetcode.com
https://leetcode.com
/support/
/jobs/
/bugbounty/
/terms/
/privacy/
/region/
mailto:[email protected]?subject=Billing%20Issue&body=Name:%0D%0A%0D%0AUsername:%0D%0A%0D%0AMessage:%0D%0A%0D%0A
mailto:[email protected]?subject=General%20Support&body=Name:%0D%0A%0D%0AUsername:%0D%0A%0D%0AMessage:%0D%0A%0D%0A
mailto:[email protected]?subject=Other%20Inquiries&body=Name:%0D%0A%0D%0AUsername:%0D%0A%0D%0AMessage:%0D%0A%0D%0A
Finish exploring https://leetcode.com

On the worker side:

// Worker 0

Start extracting links from https://leetcode.com
/support/
/jobs/
/bugbounty/
/terms/
/privacy/
/region/
mailto:[email protected]?subject=Billing%20Issue&body=Name:%0D%0A%0D%0AUsername:%0D%0A%0D%0AMessage:%0D%0A%0D%0A
mailto:[email protected]?subject=General%20Support&body=Name:%0D%0A%0D%0AUsername:%0D%0A%0D%0AMessage:%0D%0A%0D%0A
mailto:[email protected]?subject=Other%20Inquiries&body=Name:%0D%0A%0D%0AUsername:%0D%0A%0D%0AMessage:%0D%0A%0D%0A
Start extracting links from /support/
Start extracting links from /bugbounty/
Start extracting links from /region/
Start extracting links from /privacy/
// Worker 1
Start extracting links from /jobs/
Start extracting links from mailto:[email protected]?subject=General%20Support&body=Name:%0D%0A%0D%0AUsername:%0D%0A%0D%0AMessage:%0D%0A%0D%0A
// Worker 2
Start extracting links from mailto:[email protected]?subject=Other%20Inquiries&body=Name:%0D%0A%0D%0AUsername:%0D%0A%0D%0AMessage:%0D%0A%0D%0A
Start extracting links from mailto:[email protected]?subject=Billing%20Issue&body=Name:%0D%0A%0D%0AUsername:%0D%0A%0D%0AMessage:%0D%0A%0D%0A
Start extracting links from /terms/

Author

  • Yang Liu - Initial work - byliuyang
  • Vinod Krishnan - Incremental improvements - vtkrishn

License

This project is maintained under MIT license.

About

Explore the web in parallel on thousands of machines

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published