Skip to content
This repository has been archived by the owner on Jun 1, 2022. It is now read-only.

Make structured hours information available to the API #721

Closed
simonw opened this issue Jul 8, 2021 · 20 comments
Closed

Make structured hours information available to the API #721

simonw opened this issue Jul 8, 2021 · 20 comments
Labels
exporters Tools for exporting data

Comments

@simonw
Copy link
Collaborator

simonw commented Jul 8, 2021

Part of #705 - we have this from our scrapers.

@simonw simonw added the exporters Tools for exporting data label Jul 8, 2021
@simonw
Copy link
Collaborator Author

simonw commented Jul 8, 2021

A few of our scrapers populate the opening_hours key with structured data. For the moment I'm only going to trust the vaccinefinder_org scraper for this. Some example records: https://vial.calltheshots.us/dashboard/?sql=select+json_extract_path%28import_json%3A%3Ajson%2C+%27opening_hours%27%29%2C+%2A+from+source_location+where+source_name%3D%27vaccinefinder_org%27+and++json_array_length%28json_extract_path%28import_json%3A%3Ajson%2C+%27opening_hours%27%29%29+%3E+0+order+by+id+desc+limit+10%3AaIbOyXDwGDBH94OoKXxg7A80MEEI637R7XRRdXMstQ0

Data looks like this:

[
    {
        "day": "monday",
        "opens": "09:00",
        "closes": "15:00"
    },
    {
        "day": "tuesday",
        "opens": "09:00",
        "closes": "15:00"
    },
    {
        "day": "wednesday",
        "opens": "09:00",
        "closes": "15:00"
    },
    {
        "day": "thursday",
        "opens": "09:00",
        "closes": "15:00"
    },
    {
        "day": "friday",
        "opens": "09:00",
        "closes": "15:00"
    }
]

@simonw
Copy link
Collaborator Author

simonw commented Jul 8, 2021

Running this long query returns source names that have opening_hours:

select source_name, count(*)
from source_location
where json_array_length(json_extract_path(import_json::json, 'opening_hours')) > 0
group by source_name
az_arcgis	11
az_ph_pinal_clinics_gov	7
dc_district	8
il_juvare	100
ma_immunizations	139
md_arcgis	150
mn_gov	382
nyc_arcgis	291
prepmod	542
tx_harriscounty_gov	396
tx_memorialhermann	5
us_carbon_health	8934
vaccinefinder_org	41561
wa_prepmod	296
wv_dhhr	9
wyo_appt_portal	94

@simonw
Copy link
Collaborator Author

simonw commented Jul 8, 2021

The Location model currently has a hours text field:

hours = models.TextField(
blank=True,
null=True,
help_text="Do not enter hours here for mobile clinics! File a report and put mobile clinic hours in the public notes.",
)

I'm going to add hours_json which will not be editable through the admin but will instead be populated by our scrapers - at least for the moment.

@simonw
Copy link
Collaborator Author

simonw commented Jul 8, 2021

I'll also add hours_json_provenance_source_location and hours_json_last_updated_at.

simonw added a commit that referenced this issue Jul 8, 2021
Also hours_json_provenance_source_location and hours_json_last_updated_at
simonw added a commit that referenced this issue Jul 8, 2021
@simonw
Copy link
Collaborator Author

simonw commented Jul 8, 2021

Back-populating this with a migration would be interesting. Normally this is risky because of the reduced bandwidth allowed between Cloud Build and Cloud SQL, but in this case I may be able to compose a SQL "UPDATE" that does all of the work with minimal bandwidth between the two - using this pattern: https://til.simonwillison.net/django/migration-using-cte

@simonw
Copy link
Collaborator Author

simonw commented Jul 8, 2021

Some queries I used to figure this out:

with scraped_opening_hours as (select 
  matched_location_id,
  json_extract_path(import_json::json, 'opening_hours') as opening_hours
from
  source_location
where
  source_name = 'vaccinefinder_org'
  and json_array_length(json_extract_path(import_json::json, 'opening_hours')) > 0
  and matched_location_id is not null
order by matched_location_id, last_imported_at)
select count(*) from scraped_opening_hours 

Returns 41562

What if a location (in this query represented by matched_location_id) has more than one matching source location? Here's a query that de-dupes based on the matched_location_id column using DISTINCT ON https://www.postgresql.org/docs/9.3/sql-select.html#SQL-DISTINCT

with scraped_opening_hours as (select distinct on (matched_location_id)
  matched_location_id,
  json_extract_path(import_json::json, 'opening_hours') as opening_hours
from
  source_location
where
  source_name = 'vaccinefinder_org'
  and json_array_length(json_extract_path(import_json::json, 'opening_hours')) > 0
  and matched_location_id is not null
order by matched_location_id, last_imported_at)
select count(*) from scraped_opening_hours

Returns 40931. This confirms that there are indeed some locations with multiple vaccinefinder_org source locations. I checked those out using:

with scraped_opening_hours as (select 
  matched_location_id,
  json_extract_path(import_json::json, 'opening_hours') as opening_hours
from
  source_location
where
  source_name = 'vaccinefinder_org'
  and json_array_length(json_extract_path(import_json::json, 'opening_hours')) > 0
  and matched_location_id is not null
order by matched_location_id, last_imported_at)
select matched_location_id, count(*) from scraped_opening_hours
group by matched_location_id having count(*) > 1

This returned 613 rows. Checked that against a simpler version of a similar query:

select matched_location_id, count(*)
from source_location
where source_name = 'vaccinefinder_org'
group by matched_location_id
having count(*) > 1

That returned 1719 rows - then I added the filter for just ones with opening hours:

select matched_location_id, count(*)
from source_location
where source_name = 'vaccinefinder_org'
  and json_array_length(json_extract_path(import_json::json, 'opening_hours')) > 0
group by matched_location_id
having count(*) > 1

Which returned 614 rows. That one row difference is because of the matched_location_id: null row that isn't included in the first query.

@simonw
Copy link
Collaborator Author

simonw commented Jul 8, 2021

Here's the update query:

with scraped_opening_hours as (
  select distinct on (matched_location_id)
    id as source_location_id,
    matched_location_id,
    json_extract_path(import_json::json, 'opening_hours') as opening_hours,
    last_imported_at
  from
    source_location
  where
    source_name = 'vaccinefinder_org'
    and json_array_length(json_extract_path(import_json::json, 'opening_hours')) > 0
    and matched_location_id is not null
  order by matched_location_id, last_imported_at
)
update location
  set
    hours_json = scraped_opening_hours.opening_hours,
    hours_json_last_updated_at=scraped_opening_hours.last_imported_at,
    hours_json_provenance_source_location_id=scraped_opening_hours.source_location_id
  from
    scraped_opening_hours
  where
    location.id = scraped_opening_hours.matched_location_id

@simonw
Copy link
Collaborator Author

simonw commented Jul 8, 2021

It took 31.4s to run against my local development environment, updating 40,396 locations. select count(*) from location where hours_json is not null confirms that number, and the data looks good.

I'm going to try this on staging.

@simonw
Copy link
Collaborator Author

simonw commented Jul 8, 2021

Before running the data migration on staging, this query returns 0 rows:

select id, name, full_address, hours, hours_json, hours_json_last_updated_at, hours_json_provenance_source_location_id
from location
where hours_json_provenance_source_location_id is not null
limit 100

https://vial-staging.calltheshots.us/dashboard/?sql=select+id%2C+name%2C+full_address%2C+hours%2C+hours_json%2C+hours_json_last_updated_at%2C+hours_json_provenance_source_location_id%0D%0Afrom+location%0D%0Awhere+hours_json_provenance_source_location_id+is+not+null%0D%0Alimit+100%3AgHeD2WniBNJhseQ_IhgUPpEsWi8vG5xMt-iDPjgalpw

@simonw
Copy link
Collaborator Author

simonw commented Jul 8, 2021

Next steps:

  • Update scraper code to populate these columns if the scraper is vaccinefinder_org and offers newly updated hours
  • Add the hours to the JSON exposed in the API
  • Ship it all to production

@simonw
Copy link
Collaborator Author

simonw commented Jul 8, 2021

I'm going to rename the location.derive_availability_and_inventory() method to derive_availability_and_inventory_and_hours() and put the logic there.

@simonw
Copy link
Collaborator Author

simonw commented Jul 8, 2021

@simonw
Copy link
Collaborator Author

simonw commented Jul 8, 2021

I'm going to rename the location.derive_availability_and_inventory() method to derive_availability_and_inventory_and_hours() and put the logic there.

Actually I'll go with derive_details() since that's less of a mouthful.

@simonw
Copy link
Collaborator Author

simonw commented Jul 8, 2021

Here's an example of the new "derive_details" debug tool for a location with hours on staging: https://vial-staging.calltheshots.us/location/ltbpc

San_Bernardino_Health_Center_-_VIAL

@simonw
Copy link
Collaborator Author

simonw commented Jul 8, 2021

That same page also shows the new APIv0 JSON preview: https://vial-staging.calltheshots.us/location/ltbpc

{
    "id": "ltbpc",
    "name": "San Bernardino Health Center",
    "provider": null,
    "state": "CA",
    "latitude": 34.0678,
    "longitude": -117.28565,
    "location_type": "Unknown",
    "phone_number": "(714) 922-4100",
    "full_address": "1873 Commercenter W;\nSan Bernardino, CA 92408",
    "city": "San Bernardino",
    "county": "San Bernardino",
    "zip_code": "92408",
    "hours": {
        "unstructured": null,
        "structured": [
            {
                "day": "thursday",
                "opens": "07:00",
                "closes": "18:00"
            },
            {
                "day": "sunday",
                "opens": "07:00",
                "closes": "16:30"
            }
        ]
    },
    "website": "https://myturn.ca.gov/",
    "vaccines_offered": null,
    "concordances": [
        "vaccinefinder_org:04d8dfb6-5c63-4ab4-9a87-8b9d2dc498eb",
        "us_carbon_health:a9db417a-f0b7-4e6c-b20e-a81935982974"
    ],
    "last_verified_by_vts": null,
    "vts_url": "https://www.vaccinatethestates.com/?lng=-117.28565&lat=34.06780#ltbpc"
}

Note that there are no guarantees that the unstructured and structured hours fields will reflect each other. For the moment I'll solve that by telling people about it - not sure what a good long term solution for that is.

@simonw
Copy link
Collaborator Author

simonw commented Jul 8, 2021

The staging API export failed after I deployed this code. No error in Sentry though.

{
  "insertId": "hhq9h4g3g8rhqh",
  "jsonPayload": {
    "targetType": "HTTP",
    "url": "https://vial-staging.calltheshots.us/api/exportVaccinateTheStates",
    "jobName": "projects/django-vaccinateca/locations/us-west2/jobs/vaccinatethestates-api-export-staging",
    "@type": "type.googleapis.com/google.cloud.scheduler.logging.AttemptFinished",
    "status": "UNKNOWN"
  },
  "httpRequest": {},
  "resource": {
    "type": "cloud_scheduler_job",
    "labels": {
      "job_id": "vaccinatethestates-api-export-staging",
      "project_id": "django-vaccinateca",
      "location": "us-west2"
    }
  },
  "timestamp": "2021-07-08T20:33:00.654933771Z",
  "severity": "ERROR",
  "logName": "projects/django-vaccinateca/logs/cloudscheduler.googleapis.com%2Fexecutions",
  "receiveTimestamp": "2021-07-08T20:33:00.654933771Z"
}

@simonw
Copy link
Collaborator Author

simonw commented Jul 8, 2021

I bet it's because I forgot to add hours_json to this .only() call - frustrating that our tests didn't catch that:

def location_json_queryset(queryset: QuerySet[Location]) -> QuerySet[Location]:
return (
queryset.select_related(
"state",
"county",
"location_type",
"provider__provider_type",
).prefetch_related("concordances")
).only(
"public_id",
"name",
"state__abbreviation",
"latitude",
"longitude",
"location_type__name",
"import_ref",
"phone_number",
"full_address",
"city",
"county__name",
"google_places_id",
"vaccinefinder_location_id",
"vaccinespotter_location_id",
"vaccines_offered",
"zip_code",
"hours",
"website",
"preferred_contact_method",
"provider__name",
"provider__vaccine_info_url",
"provider__provider_type__name",
"dn_latest_non_skip_report",
)

simonw added a commit that referenced this issue Jul 8, 2021
@simonw
Copy link
Collaborator Author

simonw commented Jul 8, 2021

The export worked! https://staging-api.vaccinatethestates.com/v0/locations.json now has the updated data - here's an extract: more locations.json | jq '.content[0]'

{
  "id": "rec00SICtL8KJiLim",
  "name": "RITE AID PHARMACY 06466",
  "provider": {
    "name": "Rite-Aid Pharmacy",
    "provider_type": "Pharmacy",
    "vaccine_info_url": "https://www.riteaid.com/Covid-19"
  },
  "state": "CA",
  "latitude": 32.71998,
  "longitude": -117.16902,
  "location_type": "Pharmacy",
  "phone_number": "619-231-7405",
  "full_address": "1411 KETTNER BOULEVARD, SAN DIEGO, CA 92101",
  "city": null,
  "county": "San Diego",
  "zip_code": null,
  "hours": {
    "unstructured": "Monday - Sunday: 8:00 am-8:00 pm",
    "structured": [
      {
        "day": "monday",
        "opens": "08:00",
        "closes": "20:00"
      },
      {
        "day": "tuesday",
        "opens": "08:00",
        "closes": "20:00"
      },
      {
        "day": "wednesday",
        "opens": "08:00",
        "closes": "20:00"
      },
      {
        "day": "thursday",
        "opens": "08:00",
        "closes": "20:00"
      },
      {
        "day": "friday",
        "opens": "08:00",
        "closes": "20:00"
      },
      {
        "day": "saturday",
        "opens": "10:00",
        "closes": "18:00"
      },
      {
        "day": "sunday",
        "opens": "10:00",
        "closes": "17:00"
      }
    ]
  },
  "website": null,
  "vaccines_offered": null,
  "concordances": [
    "google_places:ChIJt9skOKxU2YARERJMhNf4QfA",
    "vaccinefinder:f8bd637a-1a6e-4262-b3f0-7c7a6b9b887d",
    "vaccinespotter_org:7382057",
    "vaccinefinder_org:f8bd637a-1a6e-4262-b3f0-7c7a6b9b887d",
    "rite_aid:106466",
    "_tag_provider:rite_aid",
    "us_carbon_health:48063e1f-820f-4770-a9c5-3c4cb67077c7"
  ],
  "last_verified_by_vts": "2021-04-22T22:28:33.637831+00:00",
  "vts_url": "https://www.vaccinatethestates.com/?lng=-117.16902&lat=32.71998#rec00SICtL8KJiLim"
}

@simonw simonw closed this as completed Jul 8, 2021
@simonw
Copy link
Collaborator Author

simonw commented Jul 8, 2021

Deployed to production. Here are some locations on production that now have hours_json (thanks to the data migration): https://vial.calltheshots.us/dashboard/?sql=select+hours_json%2C+public_id%2C+name+from+location+where+hours_json+is+not+null+order+by+id+desc+limit+100%3A_x5vX6w2roOqNmg5mE_LSTVggRkIuJcByUPIHa2q8Nw

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
exporters Tools for exporting data
Projects
None yet
Development

No branches or pull requests

1 participant