go-webcrawler

Web crawler written in Go as an example project. Parallelizes HTTP calls with a configurable goroutine worker pool. To simplify things, it only follows absolute https URLs found in the href attribute of <a> tags. Crawls up to a specified depth (defaults to 3) or crawls indefinitely if the user sets -depth 0.

How to Run

Ensure go is installed:

brew install go

Or follow the official installation instructions

Install Binary

go install github.com/cjlint/go-webcrawler@latest
go-webcrawler -url google.com

From the Repository

Clone the repository, then run from the repository root:

go build
./go-webcrawler -url google.com

Pipe to Output File

Go logs write to stderr by default, so use 2>&1 if you want to collect the logs in an output file

go-webcrawler -url google.com -depth 3 > out 2>&1

Tests

A small unit test suite is included in main_test.go

go test

Integration tests were not included due to time constraints, but they could be written with the httptest package.

Examples Used

https://pkg.go.dev/golang.org/x/net/html#example-Parse for HTML parsing code

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.gitignore		.gitignore
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go
main_test.go		main_test.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

go-webcrawler

How to Run

Install Binary

From the Repository

Pipe to Output File

Tests

Examples Used

About

Releases

Packages

Languages

cjlint/go-webcrawler

Folders and files

Latest commit

History

Repository files navigation

go-webcrawler

How to Run

Install Binary

From the Repository

Pipe to Output File

Tests

Examples Used

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages