Skip to content

Latest commit

 

History

History
101 lines (71 loc) · 7.08 KB

WIKIDATA.md

File metadata and controls

101 lines (71 loc) · 7.08 KB

Wikidata and friends

Wikidata

The Wikidata project is an open knowledge base of "things". The QID (or Q number) is the unique identifier of a data item on Wikidata, comprising the letter "Q" followed by one or more digits. It is used to help people and machines understand the difference between items with the same or similar names.

A number of open data projects use Wikidata codes to augment their data. So do we. Nearly all companies and brands of a certain size have a Wikidata entry. Some examples include: Starbucks, McDonald’s, Travelodge and Travelodge. The latter two examples illustrating rather well the name disambiguation.

If possible please apply a Wikidata QID to the POIs generated by your spider. The simplest way is to add it to the item_attributes field of the spider for it to be automatically applied by pipeline code. For example:

import scrapy

class TravelodgeGBSpider(scrapy.Spider):
    name = "travelodge_gb"
    item_attributes = {"brand": "Travelodge UK", "brand_wikidata": "Q9361374"}

Many of the companies that we write spiders for will be significant enough to have been added to the Name Suggestion Index (NSI) project (see below). We provide a custom scrapy nsi command for doing name queries against the NSI dataset e.g.

$ pipenv run scrapy nsi --name travelodge
"Travelodge UK", "Q9361374"
       -> https://www.wikidata.org/wiki/Q9361374
       -> https://www.wikidata.org/wiki/Special:EntityData/Q9361374.json
       -> British budget hotel chain
       -> https://www.travelodge.co.uk/
       -> item_attributes = {"brand": "Travelodge UK", "brand_wikidata": "Q9361374"}
"Travelodge", "Q7836087"
       -> https://www.wikidata.org/wiki/Q7836087
       -> https://www.wikidata.org/wiki/Special:EntityData/Q7836087.json
       -> midscale hotel chain run by Wyndham Hotels & Resorts
       -> https://www.wyndhamhotels.com/travelodge
       -> item_attributes = {"brand": "Travelodge", "brand_wikidata": "Q7836087"}

Note the Python code fragments generated for you to copy and paste into your spider as a convenience. If the NSI query does not give you what you want immediately then you may need to dig deeper, perhaps going directly to the Wikidata search page.

Name Suggestion Index (NSI)

The NSI is essentially a well curated subset of Wikidata brand and operator (company) QIDs. The main goal of the project is to aid OpenStreetMap editing of POIs by allowing easy import of common data. This is well described in a video of a presentation given at OSM State of the Map 2019.

The success of the NSI can be gauged from the rapid increase in the number of OSM elements with Wikidata codes over the previous few years.

Running our custom scrapy nsi command, but this time with a direct QID parameter specified, yields the NSI tag suggestions for the QID. This is the bottom line of the output below:

$ pipenv run scrapy nsi --code Q9361374
"Travelodge UK", "Q9361374"
       -> https://www.wikidata.org/wiki/Q9361374
       -> https://www.wikidata.org/wiki/Special:EntityData/Q9361374.json
       -> British budget hotel chain
       -> https://www.travelodge.co.uk/
       -> item_attributes = {"brand": "Travelodge UK", "brand_wikidata": "Q9361374"}
       -> {'displayName': 'Travelodge (Europe)', 'id': 'travelodge-dec11e', 'locationSet': {'include': ['es', 'gb', 'ie']}, 'tags': {'brand': 'Travelodge', 'brand:wikidata': 'Q9361374', 'internet_access': 'wlan', 'internet_access:fee': 'customers', 'internet_access:ssid': 'Travelodge WiFi', 'name': 'Travelodge', 'tourism': 'hotel'}}

Note the OSM POI category attributes for a hotel ('tourism': 'hotel') are part of the tag set that the NSI is "suggesting" in this case.

You may also notice that the bottom line has an 'id' – an internal NSI identifier of the brand. This ID is not stable, so you shouldn't hardcode it into spiders. Instead, our pipeline will retrieve it directly from NSI each time a spider launches, based on the Wikidata QID specified in the spider. If the brand/operator doesn't have a Wikidata QID in NSI, you can either raise an issue in the NSI repository, or contribute the missing QID to NSI yourself.

Finding scrapers to build by category

To looking for missing or existing scrapers by a given category, review https://nsi.guide/?t=brands and supply the category to the nsi --detect-missing command.

$ pipenv run scrapy nsi --detect-missing brands/shop/supermarket
Fetched 915 brands/shop/supermarket from NSI
Missing by wikidata: 619
"3hreeSixty", "Q7797310"
       -> https://www.wikidata.org/wiki/Q7797310
       -> https://www.wikidata.org/wiki/Special:EntityData/Q7797310.json

You can also search by a location code such as "za" or "us-ct.geojson":

$ pipenv run scrapy nsi --detect-missing za
Missing by wikidata: 66
...

Check carefully if this is simply a scraper missing a wikidata entry, or if it is truely missing a scraper.

Automatic POI categorisation

The ATP item pipeline will attempt to enhance POIs it sees automatically with OSM category data from the NSI. It only does this if there is a non-ambiguous match of QID, country location (if appropriate) and category suggestion.

In the great majority cases the automation gets the correct answer. In the cases where it fails for some reason then the automation can be disabled for the spider, and the correct categories applied in the spider code itself. See ApplyNSICategoriesPipeline docs.

A virtuous circle?

Wikidata, NSI, OSM and ATP support each other with respect to data integrity and consistency. Eyeballs in one project are able to pick up problems in another.

Providing that a site does a good job of publishing its data and providing the ATP spider does a good job of reading it, then a range of possibilities open up.

For example, it is far easier to support editors making a change to a POI from being non-branded (no QID) to one that is when we can point to the co-located scraped POI. If there is good branded data in OSM then it is easier to suggest change or removal based on changes in the branded / scraped data set.

Currently, after each weekly run of the full project, we publish a cross correlation (by QID) of the data in ATP, NSI and OSM. This is able to pick up various anomalies between and within the various projects all leading to edits which can make the open data world a better place.

Others do similar and interesting things.