Skip to content

Commit 7492f5f

Browse files
authored
Merge pull request #37 from peterbencze/development
New version
2 parents 59d925e + 42d95d4 commit 7492f5f

17 files changed

+1011
-707
lines changed

README.md

+36-16
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ Add the following dependency to your pom.xml:
1111
<dependency>
1212
<groupId>com.github.peterbencze</groupId>
1313
<artifactId>serritor</artifactId>
14-
<version>1.1</version>
14+
<version>1.2</version>
1515
</dependency>
1616
```
1717

@@ -26,38 +26,58 @@ See the [Wiki](https://github.com/peterbencze/serritor/wiki) page.
2626
BaseCrawler provides a skeletal implementation of a crawler to minimize the effort to create your own. First, create a class that extends BaseCrawler. In this class, you can customize the behavior of your crawler. There are callbacks available for every stage of crawling. Below you can find a sample implementation:
2727
```java
2828
public class MyCrawler extends BaseCrawler {
29-
29+
3030
public MyCrawler() {
31-
config.addSeedAsString("http://yourspecificwebsite.com");
32-
config.setFilterOffsiteRequests(true);
31+
// Enable offsite request filtering
32+
config.setOffsiteRequestFiltering(true);
33+
34+
// Add a crawl seed, this is where the crawling starts
35+
CrawlRequest request = new CrawlRequestBuilder("http://example.com").build();
36+
config.addCrawlSeed(request);
3337
}
3438

3539
@Override
36-
protected void onResponseComplete(HtmlResponse response) {
37-
List<WebElement> links = response.getWebDriver().findElements(By.tagName("a"));
38-
links.stream().forEach((WebElement link) -> crawlUrlAsString(link.getAttribute("href")));
40+
protected void onResponseComplete(final HtmlResponse response) {
41+
// Crawl every link that can be found on the page
42+
response.getWebDriver().findElements(By.tagName("a"))
43+
.stream()
44+
.forEach((WebElement link) -> {
45+
CrawlRequest request = new CrawlRequestBuilder(link.getAttribute("href")).build();
46+
crawl(request);
47+
});
3948
}
4049

4150
@Override
42-
protected void onNonHtmlResponse(NonHtmlResponse response) {
43-
System.out.println("Received a non-HTML response from: " + response.getCurrentUrl());
51+
protected void onNonHtmlResponse(final NonHtmlResponse response) {
52+
System.out.println("Received a non-HTML response from: " + response.getCrawlRequest().getRequestUrl());
4453
}
45-
54+
4655
@Override
47-
protected void onUnsuccessfulRequest(UnsuccessfulRequest request) {
48-
System.out.println("Could not get response from: " + request.getCurrentUrl());
56+
protected void onUnsuccessfulRequest(final UnsuccessfulRequest request) {
57+
System.out.println("Could not get response from: " + request.getCrawlRequest().getRequestUrl());
4958
}
5059
}
5160
```
5261
That's it! In just a few lines you can make a crawler that extracts and crawls every URL it finds, while filtering duplicate and offsite requests. You also get access to the WebDriver, so you can use all the features that are provided by Selenium.
5362

54-
By default, the crawler uses [HtmlUnitDriver](https://github.com/SeleniumHQ/selenium/wiki/HtmlUnitDriver) but you can also set your preferred WebDriver:
63+
By default, the crawler uses [HtmlUnit headless browser](http://htmlunit.sourceforge.net/):
5564
```java
56-
config.setWebDriver(new ChromeDriver());
65+
public static void main(String[] args) {
66+
MyCrawler myCrawler = new MyCrawler();
67+
68+
// Use HtmlUnit headless browser
69+
myCrawler.start();
70+
}
5771
```
72+
Of course, you can also use any other browsers by specifying a corresponding WebDriver instance:
73+
```java
74+
public static void main(String[] args) {
75+
MyCrawler myCrawler = new MyCrawler();
5876

59-
## Support
60-
The developers would like to thank [Precognox](http://precognox.com/) for the support.
77+
// Use Google Chrome
78+
myCrawler.start(new ChromeDriver());
79+
}
80+
```
6181

6282
## License
6383
The source code of Serritor is made available under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).

pom.xml

+9-4
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
<modelVersion>4.0.0</modelVersion>
44
<groupId>com.github.peterbencze</groupId>
55
<artifactId>serritor</artifactId>
6-
<version>1.1</version>
6+
<version>1.2</version>
77
<packaging>jar</packaging>
88

99
<name>Serritor</name>
@@ -61,12 +61,17 @@
6161
<dependency>
6262
<groupId>org.seleniumhq.selenium</groupId>
6363
<artifactId>selenium-java</artifactId>
64-
<version>3.0.1</version>
64+
<version>3.4.0</version>
6565
</dependency>
6666
<dependency>
6767
<groupId>org.seleniumhq.selenium</groupId>
6868
<artifactId>htmlunit-driver</artifactId>
69-
<version>2.23.2</version>
69+
<version>2.27</version>
70+
</dependency>
71+
<dependency>
72+
<groupId>com.google.guava</groupId>
73+
<artifactId>guava</artifactId>
74+
<version>22.0</version>
7075
</dependency>
7176
</dependencies>
7277

@@ -115,7 +120,7 @@
115120
<plugin>
116121
<groupId>org.sonatype.plugins</groupId>
117122
<artifactId>nexus-staging-maven-plugin</artifactId>
118-
<version>1.6.7</version>
123+
<version>1.6.8</version>
119124
<extensions>true</extensions>
120125
<configuration>
121126
<serverId>ossrh</serverId>

0 commit comments

Comments
 (0)