Skip to content

Latest commit

 

History

History
179 lines (142 loc) · 20.5 KB

xcoord.md

File metadata and controls

179 lines (142 loc) · 20.5 KB

XCOORD PATTERNS

  • Author: Marc. C. Ubaldino, MITRE Corporation
  • Date: 2014-June; updated 2017-August
  • Copyright MITRE Corporation, 2012-2017

XCoord is a geographic coordinate extractor. It finds the most common coordinate patterns in free text. That is, if you want to geocode documents, chat messages, bulletins, etc that contain degrees/minute/seconds, decimal degrees or military grids (MGRS) you will want to use something like XCoord.

Try the XCoord demo in Examples source folder or in the distribution, as below.
The XCoord function is integrated fully with Xponents SDK and REST API by default.


    Input files that contain various coordinate patterns to see how the extraction behaves.
    
    ./script/demo.sh xcoord -h    

        TestXCoord  -f       -- system tests
        TestXCoord  -i TEXT  -- user test 
        TestXCoord  -t FILE  -- user test with file; one test per line
        TestXCoord  -t FILE  -- user test with file
        TestXCoord  -a       -- adhoc tests, e.g., recompiling code and testing
        TestXCoord  -h       -- help. 

Coordinate Rule Library

The main program and class XCoord is at XCoord while the accompanying patterns and rules file is geocoord_patterns.cfg .

For reference, review the XCoord DEFINES as you review RULES. There are subtle variations in field definitions.

For brevity sake, only true positive tests are included. "FAIL" tests or true negatives are omitted.
One test case per RULE is provided to illustrate each pattern. Sources of patterns are derived from federal research projects performed by the MITRE Corporation.

These five families of patterns are supported:

Conventions in pattern IDs. Each pattern is enumerated with the its family; Additional nomenclature includes:

  • a = trailing hemisphere
  • b = prefix hemisphere
  • v = variable field length
  • dot = use of period separator
  • fs = fractional second variant
  • deg = has explicit use of degree symbol, and others

Table 1. Sample Listing of XCoord Patterns and Example Targets for Extraction

Family Pattern ID Example
MGRS pattern
MGRS MGRS-01 38SMB4611036560


UTM pattern
UTM UTM-01 17N 699990 3333335
// Zone/Latitude band + northing + easting; Optionally with units "m"
// for meters and or N/E marker


Degree-Minute-Second patterns
DMS DMS-01fs-a, DMS-01fs-b 01°44'55.5"N 101°22'33.0"E
N01°44'55.5" E101°22'33.0"
// fractional second resolution, w/hash marks, with hemisphere
DMS DMS-01fs-deg 01°44'55.5" 101°22'33.0"
// fractional second resolution, w/hash marks, NO hemisphere
DMS DMS-01dot-a, DMS-01dot-b 01.44.55N 055.44.33E
N01.44.55 E055.44.33
// explicit dot separator
DMS DMS-02 N42 18' 00" W102 24' 00" // variable length fields with separators and hemisphere
DMS DMS-01a, DMS-02a 421800N 1022400W
N421800 W1022400
// no field separators, D/M/S
DMS DMS-03a, DMS-03b 4218001234N 10224001234W
N4218001234 W10224001234
// no field separators; D/M/S.ss assummed


Degree-Minute patterns
DM DM-00 4218N-009 10224W-003
// obscure fractional minute notation
DM DM-01a, DM-01a-dash, DM-01a-dot;
DM-01b, DM-01b-dash, DM-01b-dot
42 18-009N 102 24-003W
42-18-009N; 102-24-003W
42.18.009N 102.24.003W
// Ambiguous fractional minute separator
// is handled with distinct patterns

N4218.009W10224.003
N42 18-005 x W102 24-008
N42.18.005 x W102.24.008
DM DM-02a, DM-02b, DM-02b-dash 4218.009N 10224.003W
N4218.0 W10224.0
N4218-0018 W10224-0444
// 02a/b allows for fixed-width D/M without separators.
DM DM-03a, DM-03b 4218009N10224003W
N4218009W10224003
// Fixed-width patten for D/M.mmm
DM DM-03-av, DM-03-av-deg, DM-03-av-decdm N42 18' W102 24'
42° 18' 102° 24'
42° 18.44' 102° 24.11'
// D/M pattern with explicit hashmarks and separators
// 03-av-decdm is pattern with NO hemisphere
DM DM-03-bv 42° 18'N 102° 24'W
// trailing hemisphere, minute resolution
DM DM-04a, DM-04b N4218 W10224
4218N 10224W
// trivial DMH or HDM pattern.
DM DM-05 /4218N4/10224W5/
// Rare military format with checksum value.
DM DM-06 OBE
DM DM-07 42 DEG 18.0N 102 DEG 24.0W
// 'DEG' spelled out. fractional minute resolution
DM DM-08 +42 18.0 x -102 24.0


Decimal Degree patterns
DD DD-01 N42.3, W102.4
DD DD-02 42.3N; 102.4W
DD DD-03 +42.3°;-102.4°
// explicit degree notation required, otherwise it is just a pair
// of floating point numbers.
DD DD-04 Latitude: N42.3° x Longitude: W102.3°
// Lat/Lon fields in text, decimal degree resolution
DD DD-05 N42°, W102°
DD DD-06 42° N, 102° W
DD DD-07 N42, W102
END

XCOORD RELEASE NOTES

XCoord 2.1 through 2.5 June 2014

  • Numerous javadoc API updates
  • MGRS filters for well known dates/months, lower case (default is to filter out lowercase), and Line endings in Latband/GZD
  • MGRS exception reporting calmed down
  • Version number for XCoord carrries with the rest of Xponents

XCoord 2.0 2013-July Independence Release

  • Xponents now fully maven built, set apart from the "opensextant" super project.
  • Improved concept of precision/resolution
  • Added more MGRS filters and pattern refinements
  • Parsing unbalanced pattern matches better now, while being more flexible with allowed punctuation. Managing false positives from looser patterns is important. E.g., pattern with +/-DEG DEG MIN SEC would be an imbalanced lat/lon pair where longitude is specified to seconds, but latitude is degrees only. Such a case is very much a false positive.

XCoord 1.6 2013-MAR St Patrick's release

  • MGRS date filtering for various common dates in recent time; for example 03 MAR 12
  • Allowing for "dashes" as separators between lat & lon, where in some cases it may appear as a hemisphere sign.
  • Introduction of Maven

XCoord v1.5 2013-JAN MLK release

Major improvements

  • Support for asymmetric coordinates; valid lat/lon formats that have varying resolution, e.g., 34:45N 44:55:10W. valid MGRS grids with typos, e.g., 483QR 443 55 ( easting has 3 digits, northing should be "550"); Grids with line breaks or interrupting whitespace, 483QR 44\n3 55; Grids with no whitespace, but invalid easting/northing, eg. 483QR44355 -- both 443 / 550 and 440 / 355 are emitted as potentially correct interpretations
  • Allow for multiple interpretations of a coordinate pattern, e.g., MGRS case above where easting/northing is ambiguous due to typos.
  • Testing data included patterns from wikipedia geotagging conventions for wiki pages.
  • Expanded test cases
  • Improved date filtering for DMS patterns with odd punctuation, e.g. 2012-12-11 10:45:00 is a time not a coordinate. Two-digit year date/time patterns matched often.
  • Consolidated patterns library primarily due to consideration of asymmetry in resolution. More smarts put into place in parsing/normalizing.

XCoord v1.3, 2012-10-27 THANKSGIVING release

Minor improvements:

  • added more DM/DMS patterns to ensure coordinates with DMS symbols only and no indication of hemisphere (-,+, ENSW) are detected and parsed
  • added such tests to Truth data; see GeocoderEval project

XCoord v1.2, 2012-10-31 HALLOWEEN release

Major improvements:

  • Precision
  • Test & Evaluation
  • Rule order -- it is preserved and is currently MGRS UTM DMS DM DD
  • Fixed many rules, added others

Details:

  • Concepts added/adjusted (org.mitre.xcoord pkg):
    • GeocoordPrecision -- tracks number of digits mainly in DD, DM, DMS patterns, and the associated precision.
    • Hemisphere -- tracks hemisphere if it is +/-, Alpha or null.
    • PrecisionScales -- much improved assessment of inherent precision in any patter *Testing added
    • TestCase, TestScript added to capture input tests
    • TestUtility improved to report truth data if given, and the carry both truth and test results along with output for evaluation later.
  • UTMParser hemisphere detection fixed. "S" was allowed to be a Northern lat band, when it usually means South.
  • DMSOrdinate now handles all logic for DD, DM, DMS parsing and calculation. Hemisphere parsing moved here (out of PatternManager)
  • PatternManager --- ensure Rules order-of-appearance is preserved from configuration file
  • GeocoordMatch -- formatting was a major concern, as with Java, printing a Float or Double allows near infinite decimal places (I mean a lot). Where you want 35.01, Java might print "35.00999989082911237", but you want "35.01" the string.
    • Use GeocoordMatch.formatLatitude() or formatLongitude() to get a printable string version of the calculated lat or lon