Additional Data Extractors #106

wodow · 2014-05-15T16:46:06Z

I have a minor fork that additionally extracts the first <h1> and <h2>, for consideration by my client program (see wodow@6d22504 )

I imagine this would be a good example use case for an "additional data extractor", as currently marked as TODO in the code in configuration.py and crawler.py.

Do you have any thoughts of how AdditionalDataExtractors should work?

I would be happy to submit a PR on this.

The text was updated successfully, but these errors were encountered:

c24b · 2014-05-15T20:59:05Z

👍 This could very interesting if extended

grangier · 2014-05-16T10:31:05Z

Hello,

At the moment additional data extractor are not used nor configured. I guess the idea here is to have the configuration handle both a custom document cleaner and an extractor.

It could be a great improvement in goose. I can take a look into that.

xav

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional Data Extractors #106

Additional Data Extractors #106

wodow commented May 15, 2014

c24b commented May 15, 2014

grangier commented May 16, 2014

Additional Data Extractors #106

Additional Data Extractors #106

Comments

wodow commented May 15, 2014

c24b commented May 15, 2014

grangier commented May 16, 2014