An EUR-Lex parser for Python.
You can install this package as follows:
pip install -U eurlexAfter installing this package, you can download and parse any document from EUR-Lex. For example, the 32019R0947 regulation:
from eurlex import get_html_by_celex_id, parse_html
# Retrieve and parse the document with CELEX ID "32019R0947" into a Pandas DataFrame
celex_id = "32019R0947"
html = get_html_by_celex_id(celex_id)
df = parse_html(html)
# Get the first line of Article 1
df_article_1 = df[df.article == "1"]
df_article_1_line_1 = df_article_1.iloc[0]
# Display the subtitle of Article 1
print(df_article_1_line_1.article_subtitle)
>>> "Subject matter"
# Display the corresponding text
print(df_article_1_line_1.text)
>>> "This Regulation lays down detailed provisions for the operation of unmanned aircraft systems as well as for personnel, including remote pilots and organisations involved in those operations."Every document on EUR-Lex displays a CELEX number at the top of the page. More information on CELEX numbers can be found on the EUR-Lex website.
For more information about the methods in this package, see the unit tests and doctests.
The following columns are available in the parsed dataframe:
text: The texttype: The type of the datadocument: The document in which the text is foundarticle: The article in which the text is foundarticle_subtitle: The subtitle of the article (when available)ref: The indentation level of the text within the article (e.g.["(1)", "(a)"]when the text is found under paragraph(1), subparagraph(a))
In some cases, additional fields are available. For example, the group field which contains the bold text under which a text is found.
Feel free to send any issues, ideas or pull requests.