-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve interoperability with form2request and Scrapy Cloud support #3
Conversation
>>> from form2request import form2request | ||
>>> from formasaurus import build_submission | ||
>>> form, data, submit_button = build_submission(html, "search", {"search query": "foo"}) | ||
>>> request_data = form2request(form, data, click=submit_button) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if the recommended default approach should be click=submit_button
or click=submit_button or False
, i.e. if Formasaurus cannot find a button, should we let form2request click the default submit button if it finds one, or force form2request not to click any submit button?
I suspect Formasaurus does not actively discard submit buttons, i.e. if there is one it will return it, in which case this would be a non-issue. If it turns out to be an actual issue, we can always update this line of the docs accordingly.
@@ -260,7 +319,7 @@ def classify_proba(self, form, threshold=0.0): | |||
return self._probs2dict(probs, threshold) | |||
|
|||
def train(self, annotations): | |||
"""Train FormExtractor on a list of FormAnnotation objects.""" | |||
"""Train formtype_model on a list of FormAnnotation objects.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure the new wording is accurate, but there is no FormExtractor
class, so I strongly suspect that the previous wording was out of date.
raise NotImplementedError( | ||
f"{clf.__class__.__name__} serialization is not implemented" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not implemented since the current FormFieldClassifier
class hard-codes prob=True
.
@@ -1,111 +0,0 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed as it seems out of date, with references to the missing FormExtractor
class.
Regarding naming:
_form2request
is from an earlier draft where form2request was actually used as a dependency, but I ended up going for a simpler approach, where both libraries have APIs that make it easy for them to work together, while keeping them independent. But I could not come up with a better name for the private module, so I kept that. Glad to change it.build_submission
that much either, but I could not think of something better for a function that “finds a form of the given type, maps input data to form fields by field type, and finds the submission button”.