-
Notifications
You must be signed in to change notification settings - Fork 2
feat: added country bounding #34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: added country bounding #34
Conversation
This largely accomplishes two things: * Addresses Issue hotosm#33 (country borders not being respected) * Makes changing data sources more modular The dataset being used for country boundaries comes from [GADM](https://gadm.org/). Unfortunately, while they do have a global ADM_0 set, it's only available as a layer in a single GeoPackage which is 2.6 GB, and thus also requires initial extraction / conversion with ogr2ogr to be useful here. The geonames dataset is replaced with the finer-resolution version, as the original one was missing several cities in testing. The new query for finding a match uses the `&&` PostGIS operator for the geometries, which returns True for any intersection. This method is an order of magnitude faster (~2 msec vs. ~42 msec in my tests) than Voronoi and ST_CONTAINS, and does not seem to suffer from errors that I can tell (see included additional tests), but it does heavily rely on the source dataset being accurate. Interestingly, using the `geom` columns on their own as a comparison with ST_CONTAINS _does_ result in erroneous results.
This improves the caching behavior, with an mtime check to prompt the user to consider re-downloading file[s] as updates may have occurred, and also has a more robust copying mechanism between temporary directories and the cache directory.
This adds the ability to perform post-import cleanup actions on datasources where known issues exist. The one I found was a mis-spelling of Simpson Bay Village in Sint Marteen. It uses psycopg's SQL formatting to safely create the necessary DML (currently supported: DELETE and UPDATE), with safety checks such as rolling back if more rows are affected than were expected.
for more information, see https://pre-commit.ci
| path="export/dump/cities500.zip", | ||
| zip_name="cities500.zip", | ||
| ) | ||
| self.geonames_old: URLConfig = URLConfig( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and the others are being temporarily maintained while the desired data sources are decided in #33 .
| zip_name="countries10m.zip", | ||
| headers=[ | ||
| Header(key="Referer", value="https://www.naturalearthdata.com/"), | ||
| Header(key="User-Agent", value="curl/8.7.1"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oddly they block Python's UA, but allow curl's.
| def _create_spatial_index(self, conn): | ||
| """Create spatial index for efficient processing.""" | ||
| self.logger.info("Creating spatial index on geometry") | ||
| def _cleanup_geonames_db(self, conn): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally the fixes would also be added to the upstream data source, of course.
| self.logger.info(f" - {cities_file}") | ||
| self.logger.info(f" - {voronoi_file}") | ||
|
|
||
| def _vacuum_full_and_analyze_db(self, conn): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was added due to repeated runs causing bloat in the DB, as Postgres' default behavior for autovacuum won't kick in until far more dead tuples exist than we're likely to add.
WARNING: VACUUM FULL locks the table while running (on my Mac, it takes ~10 seconds to complete both country and geocoding), but I assume if you're actively importing data, there won't be traffic hitting it? If there is, this can be removed, and another method found; or this can be scheduled to run during off-peak times.
| GeoTestCase(-6.3390, 54.1751, "Newry", "GB"), | ||
| GeoTestCase(55.478017, -21.297475, "Saint-Pierre", "RE"), | ||
| GeoTestCase(-6.271183, 55.687669, "Bowmore", "GB"), | ||
| GeoTestCase(88.136284, 26.934422, "Mirik", "IN"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this one since it was the original cause raising the issue. Feel free to add any others you'd like to check!
|
Wow, this is huge & looks like seriously great work! I'll check it out in detail as soon as possible 😄 |
Also I tried to make atomic commits with logical splits, if that makes it easier. |
|
Haha huge in a good way, dont worry the code is nicely modular! 🙌 Sorry for the delay reviewing this - hoping to find time soon! |
| domain: the domain portion of the URL (i.e. after http[s]:// and before the first /) | ||
| path: the path portion of the URL (i.e. everything after the domain) | ||
| alpha3_column: the name of the column in the file with an ISO 3166-1 alpha-3 code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite specific, unlike the rest. Perhaps this should be refactored to either a base class / inherited class, or some other means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could be right that perhaps we no longer need voronois! Let's test this other approach and see 😄
This fixes #33.
It does so by adding a
geomcolumn tocountrywhich contains that country's boundaries. Unfortunately, the only accurate dataset I could find is GADM, which as a comment mentions, doesn't have ADM_0 available on its own. I downloaded the global dataset, converted the GeoPackage to Shapefile, and then imported ADM_0 as had been done with other datasets previously.The geonames are also replaced with their higher-resolution version.
Additionally, as much as possible has been made modular.