Skip to content

Conversation

@stephanGarland
Copy link
Contributor

@stephanGarland stephanGarland commented Jul 4, 2025

This fixes #33.

It does so by adding a geom column to country which contains that country's boundaries. Unfortunately, the only accurate dataset I could find is GADM, which as a comment mentions, doesn't have ADM_0 available on its own. I downloaded the global dataset, converted the GeoPackage to Shapefile, and then imported ADM_0 as had been done with other datasets previously.

The geonames are also replaced with their higher-resolution version.

Additionally, as much as possible has been made modular.

This largely accomplishes two things:
  * Addresses Issue hotosm#33 (country borders not being respected)
  * Makes changing data sources more modular

The dataset being used for country boundaries comes from
[GADM](https://gadm.org/). Unfortunately, while they do have a global
ADM_0 set, it's only available as a layer in a single GeoPackage which
is 2.6 GB, and thus also requires initial extraction / conversion with
ogr2ogr to be useful here.

The geonames dataset is replaced with the finer-resolution version, as
the original one was missing several cities in testing.

The new query for finding a match uses the `&&` PostGIS operator for the
geometries, which returns True for any intersection. This method is an
order of magnitude faster (~2 msec vs. ~42 msec in my tests) than
Voronoi and ST_CONTAINS, and does not seem to suffer from errors that I
can tell (see included additional tests), but it does heavily rely on
the source dataset being accurate. Interestingly, using the `geom`
columns on their own as a comparison with ST_CONTAINS _does_ result in
erroneous results.
This improves the caching behavior, with an mtime check to prompt the
user to consider re-downloading file[s] as updates may have occurred,
and also has a more robust copying mechanism between temporary
directories and the cache directory.
This adds the ability to perform post-import cleanup actions on
datasources where known issues exist. The one I found was a mis-spelling
of Simpson Bay Village in Sint Marteen. It uses psycopg's SQL formatting
to safely create the necessary DML (currently supported: DELETE and
UPDATE), with safety checks such as rolling back if more rows are
affected than were expected.
path="export/dump/cities500.zip",
zip_name="cities500.zip",
)
self.geonames_old: URLConfig = URLConfig(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the others are being temporarily maintained while the desired data sources are decided in #33 .

zip_name="countries10m.zip",
headers=[
Header(key="Referer", value="https://www.naturalearthdata.com/"),
Header(key="User-Agent", value="curl/8.7.1"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oddly they block Python's UA, but allow curl's.

def _create_spatial_index(self, conn):
"""Create spatial index for efficient processing."""
self.logger.info("Creating spatial index on geometry")
def _cleanup_geonames_db(self, conn):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally the fixes would also be added to the upstream data source, of course.

self.logger.info(f" - {cities_file}")
self.logger.info(f" - {voronoi_file}")

def _vacuum_full_and_analyze_db(self, conn):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was added due to repeated runs causing bloat in the DB, as Postgres' default behavior for autovacuum won't kick in until far more dead tuples exist than we're likely to add.

WARNING: VACUUM FULL locks the table while running (on my Mac, it takes ~10 seconds to complete both country and geocoding), but I assume if you're actively importing data, there won't be traffic hitting it? If there is, this can be removed, and another method found; or this can be scheduled to run during off-peak times.

GeoTestCase(-6.3390, 54.1751, "Newry", "GB"),
GeoTestCase(55.478017, -21.297475, "Saint-Pierre", "RE"),
GeoTestCase(-6.271183, 55.687669, "Bowmore", "GB"),
GeoTestCase(88.136284, 26.934422, "Mirik", "IN"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this one since it was the original cause raising the issue. Feel free to add any others you'd like to check!

@spwoodcock
Copy link
Member

Wow, this is huge & looks like seriously great work!

I'll check it out in detail as soon as possible 😄
Thank you!

@stephanGarland
Copy link
Contributor Author

Wow, this is huge

Yeah… sorry. FWIW about 1/4 is tests, and another 1/4 are classes.

Also I tried to make atomic commits with logical splits, if that makes it easier.

@spwoodcock
Copy link
Member

Haha huge in a good way, dont worry the code is nicely modular! 🙌

Sorry for the delay reviewing this - hoping to find time soon!

domain: the domain portion of the URL (i.e. after http[s]:// and before the first /)
path: the path portion of the URL (i.e. everything after the domain)
alpha3_column: the name of the column in the file with an ISO 3166-1 alpha-3 code
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite specific, unlike the rest. Perhaps this should be refactored to either a base class / inherited class, or some other means.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could be right that perhaps we no longer need voronois! Let's test this other approach and see 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Country borders should be respected over geographically nearest city

2 participants