easIE

easy Information Extraction: is an easy-to-use information extraction framework that extracts data about companies from heterogeneous Web sources in a semi-automatic manner. It allows admin users to extract data about companies from heterogeneous Web sources in a semi-automatic manner by only defining a configuration file. The framework is quickly and simply generating Web Information Extractors and Wrappers. easIE offers a set of wrappers for obtaining content from Static and Dynamic HTML pages by pointing to the html elements using css Selectors.

Getting started

Each extractor extends AbstractHTMLExtractor and implements the extractFields(List<ScrapableField> fields) and extractTable(String table_selector, List<ScrapableField> fields) methods. There are four objects that extend AbstractHTMLExtractor:

StaticHTMLExtractor is responsible for extracting content from static HTML pages:

   StaticHTMLExtractor extractor = new StaticHTMLExtractor(base_url, relative_url);
   extractor.extractFields(fields);

DynamicHTMLExtractor is responsible for executing a number of events to a dynamic HTML page and extracting the defined contents:

   DynamicHTMLExtractor extractor = new DynamicHTMLWrapper(base_url, relative_url, chrome_driver_path);
   extractor.browser_emulator.clickEvent(css_selector);
   extractor.extractFields(fields);

GroupHTMLExtractor is responsible for extracting content from a group of static HTML pages with similar structure:

   GroupHTMLExtractor extractor = new GroupHTMLExtractor(group_of_pages);
   extractor.extractFields(fields);

PaginationIterator is responsible for extracting data that are distributed in different pages:

   PaginationIterator extractor = new PaginationIterator(base_url, relative_url, next_page_selector);
   extractor.extractFields(fields);

Name		Name	Last commit message	Last commit date
Latest commit History 111 Commits
configuration_files		configuration_files
src/main		src/main
.gitattributes		.gitattributes
.gitignore		.gitignore
ConfigurationSchema.json		ConfigurationSchema.json
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

easIE

Getting started

About

Releases 5

Packages

Contributors 3

Languages

License

MKLab-ITI/easIE

Folders and files

Latest commit

History

Repository files navigation

easIE

Getting started

About

Resources

License

Stars

Watchers

Forks

Releases 5

Packages 0

Contributors 3

Languages

Packages