Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gazetteer model #3979

Open
wants to merge 273 commits into
base: development
Choose a base branch
from
Open

Conversation

kleintom
Copy link
Contributor

I'm submitting a first draft of

  • a somewhat reworked GeographicItem model
  • a new Gazetteer model
  • a 'New Gazetteer' task

All of those are unfinished (and possibly somewhat broken), but I wanted to put something out to make sure I'm on the right track. All/any feedback welcome, but I'm not looking for a full review at this point - I consider this all to be in the early stages, so 'big' changes are fine with me.

Some notes/issues/questions I've encountered so far:

  • I doubt all the specs pass, though GeographicItem currently should. GeographicItem specs have been duplicated, but I haven't changed the second copy at all yet to cover the new geography column.
  • Some changes in GeographicItem:
    • shape_column_sql(shape) returns, e.g. the polygon column if that's present, or the geography column if that column holds a polygon. This will replace many of the existing CASE statements once everything switches over to a single geography column.
    • More st_* methods for working with st_* statements (feedback welcome), see also select_self.
  • In GeographicItem I changed all ST_Contains to ST_Covers. I couldn't tell why one was meant over the other in particular cases, and ST_Covers seems easier to reason about (to me). (The difference is ST_Contains doesn't contain things in its boundary, so e.g. a polygon wouldn't contain its vertices, etc.) That could be a breaking change, and would be easy to revert.
    • One question would be which should be used in new code (maybe it depends on situation).
    • If there's no builtin reason for one over the other then a couple other spots in the code could be changed to ST_Cover instead of ST_Contains, if we wanted to settle on one over the other.
  • There are nearly 500 lines of code in GeographicItem that are either
    • not referenced at all
    • are only used in specs to test themselves
    • are only used in specs to test other things
      I put all of those in a separate GeographicItem::Deprecated module for now, but it's still included in GeographicItem (for now) and I don't have any real opinion on whether that's the way forward or not.
    • Personally my thought was to put the used-only-for-specs methods somewhere else where they would only be visible in specs, but I wasn't seeing any clean way to do that.
    • I am now assuming that PostGIS >= 3.0 so that we should be able to handle all GeometryCollections in ST_* functions.

The 'New Gazetteer' task allows the user to add multiple shapes to a new gazetteer before saving (i.e. they can draw multiple shapes in leaflet, for example, for one gazetteer) - they all get bundled into a single collection behind the scenes. That differs from the georeference model I think, but fits more closely with GeographicArea shapes (I think) - this could also be a path to combining existing GeographicItems into a Gazetteer, depending on how we want to store that. Leaflet shapes and WKT shapes (and eventually other input methods) can be combined in one gazetteer - I don't keep track of which shapes came from which input types on the backend.

  • One question is how to deal with intersections of multiple shapes. I'm not sure if that may cause issues or not. One leaflet polygon drawn inside another combines to a donut (inside leaflet iirc), which is probably a useful thing, allowing users to create donuts, but maybe in other cases it's unexpected/more tricky/doesn't-combine-well-with-wkt, I'm not sure.
  • I've turned off circle creation in leaflet for now at least - we'd need to approximate by a polygon on the backend I think, which is a whole thing (is it already covered in a library somewhere?).

As @mjy suggested, I've gone with a write once read/replace pattern for the geographic item shape of a gazetteer.

I've yet to take any position on meaning of Gazetteer parent in the code.

Most of the crud create/list/show stuff should be mostly working, but it's not all finished yet.

kleintom added 30 commits June 25, 2024 12:06
…eographicItem

The distinction being that SHAPE_TYPES includes point, polygon, etc. but NOT `geography` (which is a column name in DATA_TYPES).

The point of doing that is that if you want to perform some geometric comparison on a given x = geographic_item, say you want to find all GeographicItems of shape polygon that intersect x, then you need to check all GeographicItems whose polygon column intersects x and all GeographicItems whose geography column contains a polygon that intersects x.

The decision here was to push that reality as far back as possible, by pretending we're unaware of the distinction between polygon and geography-with-polygon-shape as long as possible.
…_id with are_contained_in_item

Is there any reason not to do this? It avoids an extra GeographicItem load, and are_contained_in_item was just a thin wrapper around are_contained_in_item_by_id.

I changed all of the are_contained_in_item_by_id specs over to are_contained_in_item - some of the old are_contained_in_item specs were marked as deprecated but those all existed in the form of are_contained_in_item_by_id specs (which again is the same as are_contained_in_item), so I kept them all.
…phicItem

As far as I can tell these seem to be currently unused, though I haven't marked them deprecated in the other shapes or among the private methods of GeographicItem.
…ir own module

There are 350 lines worth. Some are used only in specs, mostly testing only themselves - feels like there should be a particular place for those, but I'm not finding it.
… to not use GEOMETRY_SQL

Don't want to be using GEOMETRY_SQL outside of GeographicItem.

I may well be missing something, but in regards to the comment about not wanting to load the entire GeographicItem, my thinking is that the size of the shape dwarfs the rest of a geoitem in general, and the rewrite here fetches that shape as wkb instead of as geojson (which i think would be larger?).

Also I copied the snippet here from the same function in Georeference :)
…methods

I think keeping the naming (st_covers) for the new methods would have been more confusing since there's now only one parameter; hopefully the new names `within(shape_sql)` and `covering(shape_sql)` are easily readable/meaning-guessable.

The lack of `covering_union_of_sql` is due to the 'covering' case requiring that the input shape(s) are not included in the result (only an issue when there's only one shape/all shapes are the same). In the 'within' case input shapes are all included in the result. I'm not sure why the difference (since ST_Covers(A, B) iff ST_Coveredby(B, A)).
D'oh, I factored out st_distance_item_to_shape and then realized where I was using it was only used in specs.

My feeling has been that it's better to select the geography column of a fixed geo_item from the database (once, as a subquery) rather that sending its geo_object, which could be "very" large, over the network and then having the database read it into memory. Thoughts?
…rnals to GeographicItem

Note the external use was incorrect for the new geography column.
…aphicItem

There are two left that I'm not going to move
…vering/coveredby usage in GeographicItem

Use covering instead of contains since "Generally this function [ST_Covers] should be used instead of ST_Contains, since it has a simpler definition which does not have the quirk that "geometries do not contain their boundary". In other words, vertics/points in the boundary of a polygon are not contained in the polygon (but are covered by the polygon), points don't contain themselves (but do cover themselves).

Previously there was a mix of ST_Covers and ST_Contains in GeographicItem.rb, so I'm not sure if the behavior of ST_Contains was specifically desired? If so, why and in what cases? Which one should I use in new code here?

THIS COULD BE A SUBTLE BREAKING CHANGE.

One of the specs added one additional intersecting point because it was on the edge of one of the spec's context polygons - which wouldn't have counted before the change! The point was introduced indirectly in a parent context because it's the geoitem of a georeference included there - that's separate from the variables introduced and used in the spec's immediate context, so that seems a little sketchy to me - nonetheless it does end up testing the edge case!

If ST_Cover is intended in general then there are other places where it could be changed.

This also changes a couple method names that I think were backwards relative to the ST_ naming convention, namely ST_X(A, B) means A X B, not B X A.
… draw

Also disallow drawing a circle; at a glance it looks like Leaflet returns a point and a radius, which we could convert into a polygon, but then there are issues with that - maybe revist later.
Change my mind (back to my original thinking) - you may want multiple shapes for things like a chain of islands or lakes, collecting sites, etc.
I think it's used in three places now, maybe it's time to move it to the general components folder somewhere?
…select_one

Simpler cleaner more explicit calls
…sing shape

Not knowing what the caller might do with the circle (or if they'll be aware of the issue), I feel like this is prudent (though in the only current case where circle is used it means make_valid_non_anti_meridian_crossing_shape will be called twice).
…crosses_anti_meridian?

The xspecify of mysterious (to me) sometimes failures of crosses_anti_meridian? when one of the vertices has longitude (exactly, and only) 0 is disturbing, and I don't know exactly when it happens.

If it turns out to cause problems, one option would be to implement crosses_anti_meridian? with the following pseudo-code (written here for polygons):

def crosses_anti_meridian?(poly: p)
  prev = nil
  for each |vertex of exterior ring of p: v|
    if prev and line(prev, v) intersects anti-meridian
      # work around a false positive:
      if v.lon == 0 || prev.lon == 0
        continue
      end
      return true
    end
    prev = p
  end
  return false
end
@kleintom
Copy link
Contributor Author

kleintom commented Nov 18, 2024

I'll have more to say, probably tomorrow, on the last new commit, but I wanted to get the first new commit out today since it fixes my previous attempt to restore GI type, which I think actually causes an exception on any read of an existing GA that references a GI with a non-null type (specs all passed since no spec references such a GA (apparently)).

@kleintom
Copy link
Contributor Author

I finally got some notes written up (and figured out, I think) for the Ecoregions2017 shapefile - I put them here for now: https://gist.github.com/kleintom/c4c851de52380b0d08b6950e198aa8ba
People just wanting a version of that shapefile that imports cleanly into TaxonWorks can just check the first couple paragraphs and skip everything else, which is more technical regarding the issues I encountered when importing the original shapefile (not all having to do with problems with the shapefile itself). There's a summary at the end so the longer account can be skipped. If there are any issues I missed please let me know.

I don't see a way around the first two RGeo-related issues (given our factory), that said I think those issues are both going to be quite rare, involving shapes crossing +/- 85.051127 latitude (two shapes involving Antarctica in the Ecoregions case) and lines intended to take the long way around the globe.

The third issue regarding what happens when imported shapes are made_valid has more wiggle room I think, maybe more reporting/warnings would be helpful there.

Lastly I'll point out the following postgis oddity (x)spec'ed in my last commit above:

select ST_Intersects(ST_GeographyFromText('LINESTRING (91 10, 0 0)'), ST_GeographyFromText('LINESTRING (180 89.0, 180 -89.0)'));

select ST_Intersects(ST_GeographyFromText('LINESTRING (91 0, 0 10)'), ST_GeographyFromText('LINESTRING (180 89.0, 180 -89.0)'));

The first returns false, as expected (for me at least): the line LINESTRING (91 10, 0 0) does not cross the anti-meridian, the second returns true: the line LINESTRING (91 0, 0 10) does cross the anti-meridian. The issue only occurs when one of the vertices is at exactly longitude 0, and the second vertex seems to have to be longitude-span > 90 away from 0, other than that I'm not seeing the pattern (feel free to shout if you know what's going on :-). Again, I think this is unlikely to be run into, but if it does become an issue I think the workaround outlined in the last commit message might work and not be too bad (or maybe it can get fixed in postgis if it is a bug).

I'm ready for more shapefile issues is anybody runs into some :-)

@mjy
Copy link
Member

mjy commented Nov 19, 2024

@kleintom Thanks very much. I just got my hands on some shapefiles and hope to hit the interfaces a little today. We also will collectively explore the UI during tomorrows help sessions, so a first wave of feedback should be coming shortly.

@LocoDelAssembly
Copy link
Contributor

@kleintom can you please add import_gazetteers queue in exe/delayed_job and any other new queue this PR is adding?

No longer autoloaded with 7.2 (I guess, I'm not sure I understand how it was being found in filter.rb prior to 7.2).
…zetteer

I'm still getting the hang of where stuff goes ... In this case it seems to me that there's a lot of processing specific to rails and GZs there, the only rgeo_shapefile relation is to call shapefile.open. (I think I placed it originally to take up less space in Gazetteer, now I've changed my mind.)
…quire it to be WGS84

I don't actually know how good this is at determining unknown EPSG, there's at least room for improvement.
… or transform

Might still be nice to know what we think the source epsg is though.
…equired

If this isn't helpful enough, the next step I'm thinking would be some kind of red/green/yellow light scheme for each extension that updates as documents are added using dropzone (and not from another browser tab).

The best solution in my opinion is to get people to upload all of their shapefile files at once using shift/ctrl+click in their file picker - the dropzone issues with doing multiple at once seem to have been fixed upstream since development on gz began.
…ot been added to New

Only checks extensions, not basenames
@kleintom
Copy link
Contributor Author

Can we scope choosing Gaz shapes in union to some set of projects (maybe, bonus)

Yes, this is doable; we've also mentioned adding the ability to clone from one project to another. Thoughts on which/either/both of those to add?

IIF there are two shapes can we have the option to make the operation INTERSECT

You can now choose Union or Intersection as the method to combine; both operations work on any number of inputs.

Bug- gazetteer_id is likely not permitted in filter params

Yes, filter otus had a bug (the others should have been working), thanks!

Upload shapes
UI - clarify which files might be missing in drag/drop options

Added a visual indicator to the New tab of the document selector to help with this, see what you think.

Keep warning messages live rather than popup?

I reworded a couple messages; if there are still issues here let me know specifics and I can make a change.

Clarify projection translation options if any

You should now be able to import any shapefile that has a .prj file (... and the prj wkt is valid and proj4 knows how to convert that projection to WGS84), for both projected and geographic CRSs. More testing here would be appreciated.

Resetting new needs tweaks

I cleared radio selections from the document selector on submission, if there's anything else I missed feel free to let me know.

Clear failed option

Field names that are the wrong type are now cleared on failed submission, if there's anything else I missed feel free to let me know.

Thanks for the feedback!

There were two issues here:
* the autocomplete call was populating project_id from params, which doesn't have a project_id
* the query itself wasn't scoping to project_id
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants