Releases: apache/incubator-stormcrawler
stormcrawler-3.2.0
What's Changed
- Release 3.1.0 by @rzo1 in #1316
- Bump Apache Storm from 3.1.1 to 2.6.4 & archetype 3.0 to 3.1.0 by @kunalpal97 in #1319
- #1299 - Add DISCLAIMER to JAR files by @rzo1 in #1320
- #1300 - Fix "files in jars have odd dates" by @rzo1 in #1321
- Bump org.yaml:snakeyaml from 2.2 to 2.3 by @dependabot in #1307
- Bump org.awaitility:awaitility from 4.2.0 to 4.2.2 by @dependabot in #1310
- Bump org.jacoco:jacoco-maven-plugin from 0.8.11 to 0.8.12 by @dependabot in #1305
- Bump org.netpreserve:jwarc from 0.29.0 to 0.30.0 by @dependabot in #1304
- Bump org.apache.maven.plugins:maven-surefire-plugin from 3.2.1 to 3.5.0 by @dependabot in #1308
- Bump aws.version from 1.12.663 to 1.12.772 by @dependabot in #1302
- Bump org.apache.solr:solr-solrj from 9.6.1 to 9.7.0 by @dependabot in #1309
- Bump com.microsoft.playwright:playwright from 1.46.0 to 1.47.0 by @dependabot in #1306
- Bump org.wiremock:wiremock from 3.5.4 to 3.9.1 by @dependabot in #1311
- Bump selenium.version from 4.24.0 to 4.25.0 by @dependabot in #1314
- #1323 Update archetype Storm version from 2.6.4 by @mvolikas in #1325
- Regenerated License file after dependency upgrades by @github-actions in #1322
- Bump OpenSearch to 2.17 + fix archetype version in README by @jnioche in #1324
- Bump org.mockito:mockito-core from 5.13.0 to 5.14.0 by @dependabot in #1334
- Bump junit.version from 5.11.0 to 5.11.1 by @dependabot in #1333
- Bump org.apache.maven.plugins:maven-archetype-plugin from 3.2.1 to 3.3.0 by @dependabot in #1332
- Bump org.apache.maven.archetype:archetype-packaging from 3.2.1 to 3.3.0 by @dependabot in #1330
- Regenerated License file after dependency upgrades by @github-actions in #1326
- Regenerated License file after dependency upgrades by @github-actions in #1335
- Bump log4j2.version from 2.23.0 to 2.24.1 by @dependabot in #1328
- Regenerated License file after dependency upgrades by @github-actions in #1337
- Bump org.jetbrains:annotations from 24.1.0 to 25.0.0 by @dependabot in #1331
- Regenerated License file after dependency upgrades by @github-actions in #1338
- Bump com.github.crawler-commons:urlfrontier-API from 2.3.1 to 2.4 by @dependabot in #1327
- Regenerated License file after dependency upgrades by @github-actions in #1340
- Store metadata as WARC Metadata records by @jnioche in #1341
- Improve robustness of WARC generation by @jnioche in #1342
- Bump org.apache.maven.plugins:maven-surefire-plugin from 3.5.0 to 3.5.1 by @dependabot in #1350
- Bump junit.version from 5.11.1 to 5.11.2 by @dependabot in #1345
- Fix configuration for Github's linguist by @mvolikas in #1344
- Bump testcontainers.version from 1.20.1 to 1.20.2 by @dependabot in #1346
- Bump org.mockito:mockito-core from 5.14.0 to 5.14.1 by @dependabot in #1349
- Bump aws.version from 1.12.772 to 1.12.773 by @dependabot in #1351
- Bump org.apache.maven.plugins:maven-javadoc-plugin from 3.10.0 to 3.10.1 by @dependabot in #1347
- Regenerated License file after dependency upgrades by @github-actions in #1352
- #1354 Fix: fix some typos in project by @psxjoy in #1355
- Fix #1312 "Sha512 hash of source release is missing the file part " by @rzo1 in #1356
- Bump de.thetaphi:forbiddenapis from 3.7 to 3.8 by @dependabot in #1359
- Bump org.jetbrains:annotations from 25.0.0 to 26.0.0 by @dependabot in #1358
- Regenerated License file after dependency upgrades by @github-actions in #1360
- Trivial: version number in warc/README fix #1317 by @jnioche in #1363
- Bugfix nofollow instructions in rel tags ignored by @jnioche in #1362
- Bump org.jetbrains:annotations from 26.0.0 to 26.0.1 by @dependabot in #1368
- Bump com.microsoft.playwright:playwright from 1.47.0 to 1.48.0 by @dependabot in #1366
- Connect to a remote instance using web sockets by @jnioche in #1361
- Bump aws.version from 1.12.773 to 1.12.776 by @dependabot in #1367
- Bump org.mockito:mockito-core from 5.14.1 to 5.14.2 by @dependabot in #1369
- Regenerated License file after dependency upgrades by @github-actions in #1370
- Bump tika.version from 2.9.2 to 3.0.0 by @dependabot in #1365
- Apache Storm 2.7.0 by @rzo1 in #1371
- Regenerated License file after dependency upgrades by @github-actions in #1372
- #1353 Fix for URLFrontier spout not taking into account the crawl ID by @klockla in #1373
- Bump junit.version from 5.11.2 to 5.11.3 by @dependabot in #1375
- Bump com.ibm.icu:icu4j from 75.1 to 76.1 by @dependabot in #1376
- Bump aws.version from 1.12.776 to 1.12.777 by @dependabot in #1377
- Bump org.wiremock:wiremock from 3.9.1 to 3.9.2 by @dependabot in #1378
- Bump testcontainers.version from 1.20.2 to 1.20.3 by @dependabot in #1379
- Remove references to ES in OpenSearch module by @jnioche in #1374
- Regenerated License file after dependency upgrades by @github-actions in #1380
- Fix #1313 "Exclude "__files" from Source Release Artifacts"" by @rzo1 in #1384
- #1301 - add build doc for the source release by @rzo1 in #1383
- [1385] bugfix - check for null before the for-each loop by @jnioche in #1386
- Sync conf files in root and archetype + explicit values for sniff conf by @jnioche in #1388
- Detect multi addresses separated by ; in a single String. Fixes #1382 by @jnioche in #1387
- Bump org.apache.maven.plugins:maven-archetype-plugin from 3.3.0 to 3.3.1 by @dependabot in #1390
- Bump selenium.version from 4.25.0 to 4.26.0 by @dependabot in #1393
- Bump org.apache.maven.plugins:maven-surefire-plugin from 3.5.1 to 3.5.2 by @dependabot in #1392
- Bump org.apache.maven.plugins:maven-javadoc-plugin from 3.10.1 to 3.11.1 by @dependabot in #1394
- Bump org.apache.maven.archetype:archetype-packaging from 3.3.0 to 3.3.1 by @dependabot in #1395
- Regenerated License file after dependency upgrades by @github-actions in #1398
- #620 Add support for shards - SolrSpout by @mvolikas in #1343
- #1403 - Downgrade log4j2 to Storm's version. Fixes #1403 by @tballison in #1404
- #140...
Apache StormCrawler 3.1.0 (Incubating)
Disclaimer
Apache StormCrawler is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.
Release Summary
This is our 2nd release after joining the ASF incubator as a poddling. It contains the new playwright module, which can be used for scraping dynamic content.
What's Changed
- send email if CI build fails by @pjfanning in #1217
- Fixes #1214 - "Update Release Docs with Feedback from 3.0 RC2 Vote" by @rzo1 in #1218
- Fix #1223 - Remove declareOutputFields from Solr StatusUpdaterBolt by @mvolikas in #1224
- Apache StormCrawler 3.0 (Incubating) by @rzo1 in #1225
- Fix #1226 "Add FileSpout TestCase for Custom Metadata Injections" by @rzo1 in #1227
- 1024 Playwright protocol implementation, fixes #1024 by @jnioche in #1228
- Fix #1230: Set sitemap key before outlink processing by @mvolikas in #1231
- #1220 - Add disclaimer for binary test artifacts by @rzo1 in #1234
- #1221 - Switch Source to tar.gz by @rzo1 in #1233
- #1215 - Update RAT exclusions. Fixes licenses by @rzo1 in #1235
- #1236 - Fix Typos in StormCrawler by @rzo1 in #1237
- #1222 - Fix Release Docs by @rzo1 in #1232
- #1238 - Avoid use of star imports by @rzo1 in #1239
- Fix #1244 "Migrate to JUnit 5" by @rzo1 in #1245
- Fix #1216 - Add RAT Exclusion File for standalone RAT by @rzo1 in #1243
- #1248 - Use pre-compiled patterns for mime type matching in TikaParser by @rzo1 in #1249
- #1251 - Update to Storm 2.6.3 by @rzo1 in #1252
- #626: Add routing field in metadata - Solr StatusUpdaterBolt by @mvolikas in #1242
- #851 Merge branch 851 into main by @mvolikas in #1256
- #1259 - Enable Dependabot by @rzo1 in #1260
- #1261 - Automatically generate THIRD-PARTY.txt via GitHub Action by @rzo1 in #1262
- #1257 - Update to Storm 2.6.4 by @rzo1 in #1258
- #1162 - Replace Coveralls with JaCoCo by @sigee in #1255
- Bump testcontainers.version from 1.19.7 to 1.20.1 by @dependabot in #1277
- Bump org.apache.maven.plugins:maven-javadoc-plugin from 3.5.0 to 3.10.0 by @dependabot in #1267
- Bump actions/setup-java from 3 to 4 by @dependabot in #1264
- Bump actions/checkout from 3 to 4 by @dependabot in #1265
- Bump org.jsoup:jsoup from 1.17.2 to 1.18.1 by @dependabot in #1271
- Regenerated License file after dependency upgrades by @github-actions in #1280
- Bump tika.version from 2.9.1 to 2.9.2 by @dependabot in #1269
- Bump com.ibm.icu:icu4j from 74.2 to 75.1 by @dependabot in #1272
- Bump org.apache.maven.plugins:maven-enforcer-plugin from 3.4.1 to 3.5.0 by @dependabot in #1289
- Bump org.apache.maven.plugins:maven-jar-plugin from 3.3.0 to 3.4.2 by @dependabot in #1288
- Bump org.apache.maven.plugins:maven-compiler-plugin from 3.11.0 to 3.13.0 by @dependabot in #1285
- Bump org.apache.rat:apache-rat-plugin from 0.15 to 0.16.1 by @dependabot in #1283
- Bump org.apache:apache from 31 to 33 by @dependabot in #1275
- Bump junit.version from 5.10.2 to 5.11.0 by @dependabot in #1278
- Bump org.apache.solr:solr-solrj from 9.5.0 to 9.6.1 by @dependabot in #1281
- Bump org.apache.maven.archetype:archetype-packaging from 2.4 to 3.2.1 by @dependabot in #1287
- Bump org.mockito:mockito-core from 5.10.0 to 5.13.0 by @dependabot in #1279
- Bump com.microsoft.playwright:playwright from 1.43.0 to 1.46.0 by @dependabot in #1268
- Bump selenium.version from 4.18.1 to 4.24.0 by @dependabot in #1266
- Bump log4j2.version from 2.23.0 to 2.24.0 by @dependabot in #1284
- Regenerated License file after dependency upgrades by @github-actions in #1282
- Fix #1290 "Add close/cleanup method to ParseFilters" by @rzo1 in #1291
- Bump opensearch.version from 2.12.0 to 2.16.0 by @dependabot in #1276
- Regenerated License file after dependency upgrades by @github-actions in #1292
- Aligned version of OpenSearch in test with recent upgrade to 2.16 by @jnioche in #1293
- Bump actions/cache from 3 to 4 by @dependabot in #1263
- Revert "Bump log4j2.version from 2.23.0 to 2.24.0" by @rzo1 in #1294
- #1295 - Add workflow to publish SNAPSHOTS to repository.a.o by @rzo1 in #1296
- Regenerated License file after dependency upgrades by @github-actions in #1297
New Contributors
- @sigee made their first contribution in #1255
- @github-actions made their first contribution in #1280
Full Changelog: stormcrawler-3.0...stormcrawler-3.1.0
Apache StormCrawler 3.0 (Incubating)
Disclaimer
Apache StormCrawler is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.
Release Summary
This is our first release after joining the ASF incubator as a poddling. It is a breaking change with renamings in the group ids and
the removal of the elasticsearch module.
What's Changed
- Handling of DateTimeParseException in WARCSpout by @michaeldinzinger in #1140
- Generate THIRD-PARTY.txt file, fixes #1145 by @jnioche in #1146
- Remove coveralls maven plugin, fixes #1148 by @jnioche in #1149
- OpenSearch - better handling of mappings by @jnioche in #1155
- Delete CODE_OF_CONDUCT.md by @pjfanning in #1158
- Create DISCLAIMER by @pjfanning in #1159
- Update NOTICE by @pjfanning in #1160
- Changed package names to org.apache by @jnioche in #1165
- Create .asf.yaml by @pjfanning in #1161
- Fix #1174 - Exclude optional artifact from storm-hdfs by @rzo1 in #1175
- Fix #1164 - Change license headers by @rzo1 in #1173
- Removed devs section from pom.xml by @jnioche in #1181
- Fix #1167 - Remove Elasticsearch module by @rzo1 in #1182
- Remove hyphens in storm-crawler by @jnioche in #1177
- Fixes #1178 "Set version to 3.0-SNAPSHOT" by @rzo1 in #1183
- Fixes #1169 - Use Apache Parent POM & Enable RAT by @rzo1 in #1180
- Removed ref to Discord in README by @jnioche in #1184
- Fix #1168 - Add a modified version of CONTRIBUTING.md by @rzo1 in #1186
- Fix #1163 - Change the GitHub templates for PRs to be more ASF specific by @rzo1 in #1185
- Upgrade to Storm 2.6.2, fix #1188 by @jnioche in #1189
- link to ASF web site .asf.yaml by @pjfanning in #1192
- Update README.md by @jnioche in #1195
- 1200 - Fix license headers by @jnioche in #1201
- #1197 - Allow to disable SSL/TLS verification in OpenSearchConnection by @rzo1 in #1199
- Fix #1202 - Add release documentation and comply with source package naming requirements by @rzo1 in #1203
- #1207 -- add forbidden-apis by @tballison in #1208
- #1209 fix for emulation error in tests run on silicon by @joshfischer1108 in #1210
- Resolves #1211 "Fix License Header" by @rzo1 in #1212
- #1205 update archetype in README by @joshfischer1108 in #1206
- Introduce "skip.format.code" to skip code formatting by default by @rzo1 in #1213
New Contributors
- @pjfanning made their first contribution in #1158
- @tballison made their first contribution in #1208
- @joshfischer1108 made their first contribution in #1210
Full Changelog: 2.11...stormcrawler-3.0
StormCrawler 2.11
Disclaimer
This is a Pre-ASF release and did not undergo a formal review by the PMC.
What's Changed
- Upgrade to OpenSearch 2.11 #1113 by @jnioche in #1114
- Use mock server for selenium tests, fix #1116 by @jnioche in #1119
- Issue #728: Adding asterisk for metadata transfer by @michaeldinzinger in #1117
- WARCSpout loads inputs using HDFS by @jnioche in #1122
- Fix wrong most recent date was set by @chhsiao90 in #1126
- Glob field mapping for indexer.md.mapping by @jnioche in #1130
- Add committer statement by @michaeldinzinger in #1134
- Implement configurable getDocumentID in DeletionBolt by @chhsiao90 in #1135
- Add two tests for SiteMapParserBolt by @michaeldinzinger in #1138
- dependency upgrades by @jnioche in #1139
New Contributors
- @chhsiao90 made their first contribution in #1126
Full Changelog: 2.10...2.11
What's new in StormCrawler 2.10
Disclaimer
This is a Pre-ASF release and did not undergo a formal review by the PMC.
What's Changed
- Selenium test by @jnioche in #1093
- refactoring timeouts Selenium by @jnioche in #1102
- Improvements and fixes to HttpRobotRulesParser when following redirects by @sebastian-nagel in #1103
and a lot more!
Full Changelog: 2.9...2.10
See https://digitalpebble.blogspot.com/2023/10/focus-on-protocol-improvements-in.html for more details on the protocol improvements
What's new in StormCrawler 2.9
Disclaimer
This is a Pre-ASF release and did not undergo a formal review by the PMC.
What's Changed
- Change HttpProtocol to defer to configured values for retryOnConnectionFailure and followRedirects by @ndtreviv in #1056
- Cache redirected robots.txt for target host only if path is /robots.txt and query is empty by @sebastian-nagel in #1057
- Issue #1043: Fixing problems after restart of Frontier service by @michaeldinzinger in #1054
- #1049 Replace "Collapse and Expand Results" Solr query with "Result Grouping" query. by @syefimov in #1053
- OpenSearch 2.7.0 + renamed OpenSearchConnection by @jnioche in #1064
- BasicURLNormalizer .unmangleQueryString() returns invalid results if "&" symbol in a parents path #1059 by @syefimov in #1062
- Dependency upgrades. fixes #1066 by @jnioche in #1067
- Automatic creation of index definitions should use the bolt type by @jnioche in #1069
- mechanism to retrieve more generic value of configuration by @jnioche in #1071
- Create DeletionBolt.java for Solr. #1050 by @syefimov in #1073
- Increase the number of redirects to 5 for Robots.txt fetching by @michaeldinzinger in #1074
- Issue #1042: Adapt parsing of robots.txt files by @michaeldinzinger in #1055
- Test URL Filtering from the command line by @jnioche in #1081
New Contributors
- @michaeldinzinger made their first contribution in #1054
- @syefimov made their first contribution in #1053
Full Changelog: 2.8...2.9
What's new in StormCrawler 2.8
Disclaimer
This is a Pre-ASF release and did not undergo a formal review by the PMC.
What's Changed
- Enforce Java 11 in archetypes by @msghasan in #1029
- Fix #1027: Ensure SC can be build with Java 17 by @rzo1 in #1030
- Indexer ES document id by @Mikwiss in #1028
- JsoupFilter as Interface by @Mikwiss in #1026
- Create method to add SearchHit info to metadata by @Mikwiss in #1034
- Status ES document id by @Mikwiss in #1036
- Limit the amount of text to be returned by the text extraction, #1038 by @jnioche in #1039
- Allow override on HttpProtocol's method addHeadersToRequest by @Mikwiss in #1041
- Fixes #1045. Remove range syntax from snakeyaml by @rzo1 in #1046
- Fix #1032: Catch the exception inside the loop to avoid breaking if one remote instance is misbehaving by @rzo1 in #1047
New Contributors
Full Changelog: 2.7...2.8
What's new in StormCrawler 2.7
Disclaimer
This is a Pre-ASF release and did not undergo a formal review by the PMC.
What's Changed
- Dependency upgrades #1016
- Opensearch module in #1011
- Maven archetype for Opensearch
- [WARC] Backward compatible storage of HTTP/2 headers by @sebastian-nagel in #1010
- Ignore empty fields indexer in #1019
- Handle single quotes in value of http-equiv="refresh" #1020
Full Changelog: 2.6...2.7
What's new in StormCrawler 2.6
Disclaimer
This is a Pre-ASF release and did not undergo a formal review by the PMC.
Highlights
- Using URLFrontier in archetype
- URLFilter becomes an abstract class
- Fixed deactivation of maxDepthFilter
- JSoupParserBolt improve performance of link extraction
- Multiple dependency upgrades
Full Changelog: storm-crawler-2.5...2.6
What's new in Stormcrawler 2.5
Disclaimer
This is a Pre-ASF release and did not undergo a formal review by the PMC.
In a nutshell
- various dependency upgrades (JSoup, CrawlerCommons, Tika, Elasticsearch)
- Java 11
- bugfix AggregationSpout does not release IsInQuery boolean sometimes
- various improvements to URLFrontier module
In more details
- FEATURE-964: custom crawl delay per page by @juli-alvarez in #967
- Issue 970 HttpProtocol doesn't consider http.content.limit in test for filesize by @wowasa in #972
- Add ChannelManager for local channel management and constants to Spout.java by @FelixEngl in #982
- Fix error when spaces in path to test-resources of StatusBoltTest in ElasticSearch-Module by @FelixEngl in #985
- Add unit test basics for URLFrontier. by @FelixEngl in #984
- Fix starvation and busy waiting of StatusUpdaterBolt.java, add Constants. by @FelixEngl in #983
- Fix starvation and busy waiting of ES StatusUpdaterBolt (Fixes #986) by @FelixEngl in #988
- Fix starvation and busy waiting of ES IndexerBolt by @FelixEngl in #989
- HttpProtocol use the md protocol.set-headers to add custom header by url by @Mikwiss in #993
New Contributors
Full Changelog: 2.4...storm-crawler-2.5