You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a minor fork that additionally extracts the first <h1> and <h2>, for consideration by my client program (see wodow@6d22504 )
I imagine this would be a good example use case for an "additional data extractor", as currently marked as TODO in the code in configuration.py and crawler.py.
Do you have any thoughts of how AdditionalDataExtractors should work?
I would be happy to submit a PR on this.
The text was updated successfully, but these errors were encountered:
At the moment additional data extractor are not used nor configured. I guess the idea here is to have the configuration handle both a custom document cleaner and an extractor.
It could be a great improvement in goose. I can take a look into that.
I have a minor fork that additionally extracts the first
<h1>
and<h2>
, for consideration by my client program (see wodow@6d22504 )I imagine this would be a good example use case for an "additional data extractor", as currently marked as TODO in the code in
configuration.py
andcrawler.py
.Do you have any thoughts of how
AdditionalDataExtractor
s should work?I would be happy to submit a PR on this.
The text was updated successfully, but these errors were encountered: