Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CoverageUnion - improve performance for huge amount of geometries #1100

Open
jandam opened this issue Dec 11, 2024 · 5 comments
Open

CoverageUnion - improve performance for huge amount of geometries #1100

jandam opened this issue Dec 11, 2024 · 5 comments

Comments

@jandam
Copy link

jandam commented Dec 11, 2024

Input: 1M of rectangular polygons

org.locationtech.jts.coverage.CoverageUnion is slow because it internally calls GeometryCollection::getDimension multiple times. GeometryCollection::getDimension process all internal polygons O(n).

Simple test.

  • Not suggested fix.
  • Performance improvement using org.locationtech.jts.operation.overlayng.CoverageUnion.
  • !!! Package is "overlayng" !!!

current code:
CoverageUnion.union(factory.createGeometryCollection(coverage))

  • union from GeometryCollection - O(n) for getDimension

improved performance:
CoverageUnion.union(factory.buildGeometry(coverage))

  • union from MultiPolygon - O(1) for getDimension

Suggested fix:

  • cache dimensions in org.locationtech.jts.operation.overlayng.InputGeometry
  • replace all access to Geometry::getDimension with method call
class InputGeometry {
 
  //private static final PointLocator ptLocator = new PointLocator();

  private Geometry[] geom = new Geometry[2];  // redundant array allocation
  private PointOnGeometryLocator ptLocatorA;
  private PointOnGeometryLocator ptLocatorB;
  private boolean[] isCollapsed = new boolean[2];

  private int[] dimension = new int[2]; // new code

  public InputGeometry(Geometry geomA, Geometry geomB) {
    geom = new Geometry[] { geomA, geomB };
    dimension[0] = geomA == null ? -1  : geomA.getDimension();  // new code
    dimension[1] = geomB == null ? -1  : geomB.getDimension(); 
  }
...
    public int getDimension(int index) {
        return this.dimension[index];
    }

    public boolean isArea(int geomIndex) {
        return this.getDimension(geomIndex) == 2;
    }

    public boolean isAllPoints() {
        return this.getDimension(0) == 0 && this.getDimension(1) == 0; // redundant: this.geom[1] != null 
    }

    public boolean hasEdges(int geomIndex) {
        return this.getDimension(geomIndex) > 0;
    }
...
}

Thank you

@dr-jts
Copy link
Contributor

dr-jts commented Dec 11, 2024

Thanks for pointing this out. Even the little things can hurt performance, and it's sometimes surprising where they lurk. This was an issue in RelateNG as well (but worse there, since hasDimension was the problem, which is harder to cache.

I agree that caching the dimension in OverlayNG.InputGeometry is a good idea. It's also possible to add a short-circuit in GeometryCollection.getDimension when a dim 2 geometry has been found. I'll add both of these optimizations.

@dbaston
Copy link
Contributor

dbaston commented Dec 11, 2024

Would it make sense to cache all of has0, has1, has2, hasZ, hasM in GeometryCollection, rather than dealing with them where they are used? They could be populated eagerly on construction or lazily on access.

@dr-jts
Copy link
Contributor

dr-jts commented Dec 11, 2024

Would it make sense to cache all of has0, has1, has2, hasZ, hasM in GeometryCollection, rather than dealing with them where they are used? They could be populated eagerly on construction or lazily on access.

That occurred to me too. Not sure about this - seems like a lot of extra machinery. But then again, it's being built anyway in client classes.

An intermediate option is to create a helper class which caches Geometry dimension information. Then that can be used only where it is critical (so far, this has only been an issue in OverlayNG and RelateNG).

@dbaston
Copy link
Contributor

dbaston commented Dec 11, 2024

It doesn't seem too painful...see dbaston/libgeos@8a447c1

@dr-jts
Copy link
Contributor

dr-jts commented Dec 11, 2024

There's storage overhead, though. And the storage cost is incurred by every collection object, even homogeneous collections which don't need the cache.

On balance I prefer a separate cache, which is only needed by the algorithms where the overhead is significant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants