Web crawler written in Go as an example project.
Parallelizes HTTP calls with a configurable goroutine worker pool.
To simplify things, it only follows absolute https URLs
found in the href attribute of <a>
tags.
Crawls up to a specified depth (defaults to 3) or crawls
indefinitely if the user sets -depth 0
.
Ensure go is installed:
brew install go
Or follow the official installation instructions
go install github.com/cjlint/go-webcrawler@latest
go-webcrawler -url google.com
Clone the repository, then run from the repository root:
go build
./go-webcrawler -url google.com
Go logs write to stderr by default, so use 2>&1
if you want to collect the logs in
an output file
go-webcrawler -url google.com -depth 3 > out 2>&1
A small unit test suite is included in main_test.go
go test
Integration tests were not included due to time constraints, but they could be written with the httptest package.
- https://pkg.go.dev/golang.org/x/net/html#example-Parse for HTML parsing code