Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional Data Extractors #106

Open
wodow opened this issue May 15, 2014 · 2 comments
Open

Additional Data Extractors #106

wodow opened this issue May 15, 2014 · 2 comments

Comments

@wodow
Copy link

wodow commented May 15, 2014

I have a minor fork that additionally extracts the first <h1> and <h2>, for consideration by my client program (see wodow@6d22504 )

I imagine this would be a good example use case for an "additional data extractor", as currently marked as TODO in the code in configuration.py and crawler.py.

Do you have any thoughts of how AdditionalDataExtractors should work?

I would be happy to submit a PR on this.

@c24b
Copy link

c24b commented May 15, 2014

👍 This could very interesting if extended

@grangier
Copy link
Owner

Hello,

At the moment additional data extractor are not used nor configured. I guess the idea here is to have the configuration handle both a custom document cleaner and an extractor.

It could be a great improvement in goose. I can take a look into that.

xav

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants