Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace the bioguide scraper with one that can do a deep parse of the bioguide #304

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

JoshData
Copy link
Member

@JoshData JoshData commented Aug 2, 2015

This is a bit crazy, but it kind of works. The bioguide is written with a fairly consistent format. This commit replaces bioguide.py with a new deep parser for bioguide entries.

It produces output like this:

activities:
 [...snip...]
- text: Minister to England 1815-1817, assisted in concluding the convention of commerce
    with Great Britain
- date:
    end: 1825
    start: 1817
  text: Secretary of State in the Cabinet of President James Monroe
- date: 1829-03-03
  text: decision in the 1824 election of the President of the United States fell,
    according to the Constitution of the United States, upon the House of Representatives,
    as none of the candidates had secured a majority of the electors chosen by the
    states, and Adams, who stood second to Andrew Jackson in the electoral vote, was
    chosen and served from March 4, 1825, to
- date: 1834
  text: elected as a Republican to the U.S. House of Representatives for the Twenty-second
    and to the eight succeeding Congresses, becoming a Whig
- text: served from March 4, 1831, until his death
- text: chairman, Committee on Manufactures (Twenty-second through Twenty-sixth, and
    Twenty-eighth and Twenty-ninth Congresses), Committee on Indian Affairs (Twenty-seventh
    Congress), Committee on Foreign Affairs (Twenty-seventh Congress)
- date: 1834
  text: unsuccessful candidate for Governor of Massachusetts
- text: interment in the family burial ground at Quincy, Mass.
- text: subsequently reinterred in United First Parish Church
born:
  date: 1767-07-11
  location: Braintree, Mass.
died:
  date: 1848-02-23
  location: the U.S. Capitol Building, Washington, D.C.
elected:
- dates:
    end: 1808-06-08
    end-reason: resignation
    start: 1803-03-04
  elections:
  - how: elected
    party: Federalist
    type: senate
family-relations:
- relation: son
  to:
    name: John Adams
- relation: father
  to:
    name: Charles Francis Adams
- relation: brother-in-law
  to:
    name: William Stephens Smith
name: ADAMS, John Quincy
name-info: []
roles:
- state: MA
  type: Senator
- state: MA
  type: Representative
- ordinal: 6
  type: President of the United States

for

ADAMS, John Quincy, (son of John Adams, father of Charles Francis Adams, brother-in-law
of William Stephens Smith), a Senator and a Representative from Massachusetts and
6th President of the United States; born in Braintree, Mass., July 11, 1767; acquired
his early education in Europe at the University of Leyden; was graduated from Harvard
University in 1787; studied law; was admitted to the bar and commenced practice
in Boston, Mass.; appointed Minister to Netherlands 1794, Minister to Portugal 1796,
Minister to Prussia 1797, and served until 1801; commissioned to make a commercial
treaty with Sweden in 1798; elected to the Massachusetts State senate in 1802; unsuccessful
candidate for election to the U.S. House of Representatives in 1802; elected as
a Federalist to the United States Senate and served from March 4, 1803, until June
8, 1808, when he resigned, a successor having been elected six months early after
Adams broke with the Federalist party; Minister to Russia 1809-1814; member of the
commission which negotiated the Treaty of Ghent in 1814; Minister to England 1815-1817,
assisted in concluding the convention of commerce with Great Britain; Secretary
of State in the Cabinet of President James Monroe 1817-1825; decision in the 1824
election of the President of the United States fell, according to the Constitution
of the United States, upon the House of Representatives, as none of the candidates
had secured a majority of the electors chosen by the states, and Adams, who stood
second to Andrew Jackson in the electoral vote, was chosen and served from March
4, 1825, to March 3, 1829; elected as a Republican to the U.S. House of Representatives
for the Twenty-second and to the eight succeeding Congresses, becoming a Whig in
1834; served from March 4, 1831, until his death; chairman, Committee on Manufactures
(Twenty-second through Twenty-sixth, and Twenty-eighth and Twenty-ninth Congresses),
Committee on Indian Affairs (Twenty-seventh Congress), Committee on Foreign Affairs
(Twenty-seventh Congress); unsuccessful candidate for Governor of Massachusetts
in 1834; died in the U.S. Capitol Building, Washington, D.C., February 23, 1848;
interment in the family burial ground at Quincy, Mass.; subsequently reinterred
in United First Parish Church.

The output is rough. There are lots of incorrect parses. In this case, one of the elections isn't recognized by the parser. Maybe some of these issues can be fixed. And the schema of the output is a little unpredictable because it's trying to handle a lot of cases in the input.

I've posted the complete output here:

https://www.govtrack.us/data/misc/bioguide-parsed.yaml (30 MB)

@JoshData JoshData force-pushed the bioguide-deep-parse branch from 5cf842f to a9cd8b7 Compare August 2, 2015 22:36
@JoshData JoshData mentioned this pull request Aug 2, 2015
@dannguyen
Copy link
Contributor

This is great!

@dannguyen
Copy link
Contributor

For a future feature, one of the things that I've thought could be easy is to include an education field. Many of the entries share the same phrasing:

was graduated from Harvard University in 1787; studied law; was admitted to the bar and commenced practice in Boston, Mass

Of course, there's a lot of nuance. Some bios mention admittance but not graduation...so that means the resulting schema would have to account for the different educational outcomes and actions, as well as multiple schools and degrees. But my estimate is that there's a large amount of low-hanging fruit, i.e. "was graduated from [some college]" that could be filled out. I think the topic of educational background is fascinating, even beyond the obvious finding that Harvard by far as the most alumni in the federal legislative structure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants