Regular Expression (REGEX) patterns is a common solution to detect relatively concrete sequences characters and symbols. REGEX, however, is not a complete solution to help validate what is found: for example, finding a 10-digit sequence of numbers does not alone imply you found a phone number; Finding a tuple of digits separated by slashes is not alone to imply a date format. Additional validation is important ... usually.
Xponents FlexPat is a general methodology for developing REGEX-based extractors. Consider alternative solutions, such as YARA -- very powerful, but much more intricate and tailored specifically toward malware/cyberforensics. FlexPat provides a more general abstraction around REGEX, specifically:
- to improve the testability of patterns
- to iterate on variations,
- to streamline validation, and
- to externalize the patterns, concepts and test cases for all to see
FlexPat currently operates in Java or Python with the same patterns file syntax. On the last point above, the intent of FlexPat was to get the patterns out of language specific source code and into a readable form that all team members could comprehend and weigh in on test cases.
To date Xponents FlexPat extractors include XCoord for geo-coordinate patterns, XTemporal for date/time
patterns,
and PoLi for general tutorial/demonstration purposes using simple patterns like email and telephone numbers.
They are described in more detail with the actual files and supporting material below.
XCoord
is a geographic coordinate extractor and normalizer that finds latitude/longitude pairs or grids such
as MGRS
or UTM
. The patterns are either decimal degrees, minutes seconds, and/or fractional parts along
with hemisphere symbology. Patterns include these: Coord Patterns, 2013-2017, as
implemented in this patterns
definition geocoord_patterns.cfg.
This was drafted and operationalized
in Java here and has not yet been ported to Python (In actuality the first extractor implementation before OpenSextant
was
in Python not using the FlexPat approach). Coordinate patterns include:
- DD (decimal degrees) ~ 39.0N, -117.2W
- DM, DMS (degrees/minutes/seconds) ~ N39º 0' x W117º 12'
- MGRS ~ Quad/Zone/Easting/Northing
- UTM ~ Zone/Band/Easthing/Northing
FILES: Coord Patterns, Patterns Config: * *geocoord_patterns.cfg **.
XTemporal
a date/time extractor and normalizer that finds dates and date/time patterns, implemented
with this patterns
definition datetime_patterns.cfg.
Patterns include
- MDY ~ Month/Day/Year, e..g,
Sept 22nd, 2017
or09/22/2017
- DMY, DMYT ~ Day/Month/Year/Time ~
22 SEPT 2017 0700Z
- YMD ~ Year/Month/Day ~
2017-09-22
- DTM ~ Date+Time ~
2017-09-22T0700-0500
FILES: Patterns Config: * *datetime_patterns.cfg **
PoLi
or patterns-of-life demonstration, which includes well-known patterns like telephone numbers, email address,
and money. Those patterns are contained
in poli_patterns.cfg.
As a demonstration of FlexPat,
this set of patterns was provided only to show the development process of additional patterns. It is here for
illustration. On other projects we have implemented such patterns in much more depth, albeit such things are
not always open sourced. These patterns include:
- Telephone patterns - with country code prefixes; validation includes confirming valid country + exchange pairings.
- Email and usernames - any user handle of the form
"user@domain"
- URLs and IP Addresses - Internet locations are parsed for prototcol, domain and addresses are resolved to a city or ISP if possible.
FILES: Patterns Config: * poli_patterns.cfg*
While XCoord and XTemporal above are complex regarding their parsing, they are relatively well-contained and easy. Patterns tackled using the more general solutions demonstrated in PoLi show that the REGEX detection is just the first part of the problem, and the user has to bring in that sense of validation.
That validation is implemented (in Python) by subclassing opensextant.FlexPat.PatternMatch
and implementing a
normalize()
function to validate the detected pattern and groups. We'll get into that more below with the
optional CLASS
directive.
First here is the outline of the standard FlexPat patterns configuration file -- which should be language independent
(yes, until you specify the optional CLASS
, which is the name of your custom class which may vary depending on your
programming language).
This FlexPat uses a "patterns configuration" file, which contains the clauses for DEFINE
, RULE
, TEST
, and CLASS
-- the essential ingredients for a pattern extractor pipeline. Outlining these more:
DEFINE
: define discrete groups or "sub-patterns" that recur in your patternsRULE
: an actual REGEX, defined with named groups only identified by yourDEFINEs
or any other valid REGEX syntax- pattern family: a logical grouping of like patterns that may have subtle variations, e.g.,
MDY
has about 6 total month-day-year patterns. But all such variants are easily referenced by naming the pattern familyMDY
. TEST
: is a single example of the pattern+variant to be detected, parsed and/or normalized by theRULE
. By convention aFAIL
comment trailing the test is used to denote a test case that should NOT be detected by theRULE
. Alternatively, the pattern may detect a match, but through theCLASS
.normalize()
implementation theRULE
will yield a pattern match that is filtered out. In conclusion, there are two chances to succeed here -- (a)RULE
detect or not detect the match, and/or (b)CLASS
normalization validates or invalidates the match.
This is the value of filling out as manyTEST
test cases as possible to touch on variants.CLASS
: An optional custom class that subclassesPatternMatch
to carry REGEX subgroups and allow for additional normalization, validation, etc.
The PoLi example is provided as a template for starting your own set of patterns: [poli_patterns.cfg](https://github.com/OpenSextant/Xponents/blob/master/Core/src/main/resources/poli_patterns.cfg
- Java:
- Python:
- Testing:
- XCoord example: Invoked
using
./script/xponents-demo.sh
, TestXCoord.java - FlexPat example
- XCoord example: Invoked
using