Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Py3 upgrade and Pacer Refactoring #171

Merged
merged 16 commits into from
Feb 2, 2017
Merged

Py3 upgrade and Pacer Refactoring #171

merged 16 commits into from
Feb 2, 2017

Conversation

voutilad
Copy link
Contributor

Currently the test_extract_written_documents_report test is skipped/disabled. Didn't delete it while we review if the logic is covered by the new tests.

Did some small cleanup in support of Py3 using six. Wasn't major.

Pacer refactoring now supports transition to newer versions of requests when needed and should be easier to follow. Added a bunch of tests around it as well.

Changed the setup.py to have the test requirements stuff baked in for mock and vcrpy, but since they aren't needed for execution they're not in the requirements.txt file. If someone runs either tox or python setup.py test they get installed on the fly.

… tricks. added tox for testing. still an issue with the title case function due to how python handles unicode strings now.
…t tries to see if a string starts with unicode or not.
…ompatability wrapper function around calls to the requests response objects.
…he mock not closing a connection. removed my stupid broken non-fix for test_pacer.py
…kie jar instance. refactored out posts to PACER as it turns out you need some black magick voodoo to form the post body into something it will enjoy.
added mocks dependency for unit tests (to tox.ini and requirements-dev.txt
start refactoring some of the Pacer stuff into a PacerSession class that extends requests.Session to handle PACER nuances
tests passing locally with tox using free login.
…irements.txt file. still need to update README.rst about changes.

refactored the BadLoginException into the juriscraper.pacer.http module as it fits better next to the place that raises it.
added default timeout value of 300 to pacer sessions since it seemed commonly set elsewhere
…ll supporting the legacy test site that does not seem supported at the moment.
Copy link
Member

@mlissner mlissner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK! This was a big PR, but it's essentially a go. I made a lot of comments for tweaks, but I don't think there's much that's substantive.

if isinstance(s, unicode):
s = s.encode('ascii', 'ignore')
#if isinstance(s, six.text_type):
# s = s.encode('ascii', 'ignore')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why commented out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to remember...but I believe it was doing nothing but cause problems since Py3 works fine with unicode and Py2 does better. I couldn't figure out the point of forcing things to ASCII as it didn't seem to impact the tests. Either I'm missing something or it's legacy stuff.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. This code isn't used often, so we can leave this and uncomment it if needed.


# Fix misspellings
for i, j in MISSPELLINGS.iteritems():
for i, j in six.iteritems(MISSPELLINGS):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just do .items() in Python 2.7 and in 3.x?

@@ -31,6 +32,8 @@
ALL_CAPS = re.compile(r'^[A-Z\s%s%s%s]+$' % (PUNCT, WEIRD_CHARS, NUMS))
UC_INITIALS = re.compile(r"^(?:[A-Z]{1}\.{1}|[A-Z]{1}\.{1}[A-Z]{1})+,?$")
MAC_MC = re.compile(r'^([Mm]a?c)(\w+.*)')


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine.

return CAPFIRST.sub(lambda m: m.group(0).upper(), word)
except UnicodeEncodeError:
# starts with unicode
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a corner case within a corner case, but can't we uppercase unicode characters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was all due to the test case of ['Reading between the lines of steve jobs’s ‘thoughts on music’'expecting to be converted to u'Reading Between the Lines of Steve Jobs’s ‘thoughts on Music’'].

That character (\u2018) before thoughts seems to break the regex patterns in Python 3. In Python 3.x it was giving this as output 'Reading Between the Lines of Steve Jobs’s ‘Thoughts on Music’'. Which failed the original test. Since I'm presuming the tests reflect the expected design I figured the workaround in Python3 was to simply test if the first character was unicode and if so skip the word.

I guess now that I think of it, something like canelo álvarez would end up Canelo álvarez with this logic. I'm not sure what the original expectation was beyond what I saw in the test cases. ¯_(ツ)_/¯

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually seems like the tests are wrong. We should be capitalizing thoughts in the example above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The very next test case has the same situation:
Input: 'seriously, ‘repair permissions’ is voodoo'
Expected Output: u'Seriously, ‘repair Permissions’ is Voodoo'

Should I change these and fix the unicode stuff I did?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, please. This code is pretty old. Not sure why the test cases suck.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it's been this way a loooong time maybe since first commit: https://github.com/freelawproject/juriscraper/blob/626723b4373d65463e03d233bc1756128d039e1b/tests/tests.py

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, seems about right.

@@ -182,7 +205,7 @@ def fix_camel_case(s):
s_out = s
else:
s_out = s[0]
for i in xrange(1, len(s)):
for i in six.moves.range(1, len(s)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just make this range instead of xrange? Remove the need for six?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to iteritems...yeah we shouldn't have an issue simplifying. I believe the catch is in Python 3.x range acts like xrange in that I think it's a lazy sequence while in Python 2.x range returns a list immediately. The length of s shouldn't be an issue here, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, should be fine.

@@ -60,7 +61,7 @@ def test_fix_future_year_typo(self):
'12/01/2806': '12/01/2806', # Should not change
'12/01/2886': '12/01/2886', # Should not change
}
for before, after in expectations.iteritems():
for before, after in six.iteritems(expectations):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use .items() here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, wasn't aware of that. Seems to be functionally the same although it doesn't return an iterator. I think for this case since the data isn't large it's moot. I'll back out the calls to six.iteritems.

@@ -387,7 +385,7 @@ def test_make_short_name(self):
def test_quarter(self):
answers = {1: 1, 2: 1, 3: 1, 4: 2, 5: 2, 6: 2, 7: 3, 8: 3, 9: 3, 10: 4,
11: 4, 12: 4}
for month, q in answers.iteritems():
for month, q in six.iteritems(answers):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.items()?

@@ -400,7 +398,7 @@ def test_is_first_month_in_quarter(self):
6: False,
7: True,
}
for month, is_first in answers.iteritems():
for month, is_first in six.iteritems(answers):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same.

@@ -907,7 +905,7 @@ def test_colo_coloctapp(self):
}

scraper = colo.Site()
for raw_string, data in tests.iteritems():
for raw_string, data in six.iteritems(tests):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same.

"""
data = {'name': ('filename', 'junk')}

self.session.post('http://free.law', files=data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https! :)

@voutilad voutilad merged commit 33756fc into master Feb 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants