Every webpage contains data. Sometimes, this data proves useful in a context other than the user interface of the webpage itself. Happily, with the right tools, we can extract data from any webpage. Indeed, we might think of a webpage as a "really bad" API (credit: Forest Gregg), with which we can interact and collect information.
At DataMade, we use the lxml
library to access webpages. The lxml
library parses and processes HTML. It’s well-documented, expansive, and popular.
The lxml
documentation provides extensive discussion of the library and all it has to offer. And this guide gives a snappy overview of scraping, in general.
For now, we will cover a few basics and then consider a sophisticated scraper: python-legistar
.
Keep your project organized with a virtualized environment, which safely isolates requirements for a unique Python app. We recommend using virtualenv
and virtualenvwrapper
.
mkvirtualenv scraper-tutorial
pip install requests
pip install lxml
Next, let’s write some code. Create a new Python file (e.g., new_scraper.py
), and import two libraries at the top of the file:
import requests
import lxml.html
from lxml.etree import tostring
Now, identify a website that holds useful, precious, and/or interesting data. For this tutorial, we will use the calendar of events available in Legistar from the Los Angeles Metro Transportation Authority.
Get the text of the webpage, using the requests
library:
entry = requests.get(‘ https://metro.legistar.com/Calendar.aspx ‘).text
Then, parse the retrieved page with the lxml
library:
page = lxml.html.fromstring(entry)
The fromstring
method returns HTML – either a full HTML document or an HTML fragment, depending on the input.
Next, we need to locate and isolate elements within the DOM. We can do this with xpath
- a collection of precise expressions used to traverse an XML document. (Here's a cheatsheet.)
Let’s first inspect the webpage in question: Visit the site, right click, and select "Inspect" to open the Elements tab of the developer console. You'll see a little something like this:
<body>
…
<form>
…
<div id=’ctl00_Div1’>
<div id=’ctl00_Div2’>
<div id=’ctl00_divMiddle’>
…
<div id=’ctl00_ContentPlaceHolder1_MultiPageCalendar’>
…
<table>
...
...
</body>
That’s a messy DOM! Notice that the page contains a body (of course) and a form with several nested divs and tables.
We want to access a particular table, i.e., the table with information about Metro meetings. To do so, we can write something like this:
div_id = 'ctl00_ContentPlaceHolder1_divGrid'
events_table = page.xpath("//body/form//div[@id='%s']//table" % div_id)[0]
Let’s break this down.
"/" tells xpath
to look for direct children: with "body/form," xpath
looks for instances of forms that sit directly within the body, rather than forms nested further within the DOM.
On the other hand, "//" tells xpath
to look for all descendents: with "//table," xpath
looks for all table descendants of the specified div. How do you specify a div? @id allows us to search for a div with a unique identifier. It prevents the hassle of iterating over handfuls of nested divs, which – in this example – could get rather sticky.
xpath
returns a list. In this example, we know that we want the first element in this list - so we conclude the above code snippet by grabbing the zeroeth index.
Want to see what’s inside the element? You can print your table to the console. Call the tostring
function, and check out the results.
print(tostring(events_table))
We want to get content inside the table cells. First, get all table rows, and then, iterate over each row, get tds from the row, and iterate over the tds - isolating the content of individual table cells.
table_rows = events_table.xpath(".//tr")
for row in table_rows:
tds = row.xpath("./td")
for td in tds:
print(td.text_content())
You can do whatever you like with the the table cell content. Save it in a list! Create a dict with the headers! Import it to the database! Or simply print it, and enjoy your mastery over the DOM.
Coming soon!