A Watir ruby script to scrape business registration data from ARBK's website.
- MongoDB: to persist the scraped data.
- ruby: to run the ruby script.
- ruby-dev: to install the ruby mongo driver.
- Make: to install ruby gems.
- zlib: we need to install the watir-nokogiri gem which depends on zlib (or else we get the error: "zlib is missing; necessary for building libxml2").
- ChromeDriver - WebDriver for Chrome: to interact with the Chrome driver via the watir ruby gem.
- rubygems.
- mongo-ruby-driver: a mongo driver.
- watir: interface to script interactions with the Chrome browser.
- nokogiri: an HTML, XML, SAX, and Reader parser. Among Nokogiri's many features is the ability to search documents via XPath or CSS3 selectors.
Errors can occur during the scraping process. The following is a list of possible errors.
- no such window: target window already closed\nfrom unknown error: web view not found\n (Session info: chrome=57.0.2987.110)\n (Driver info: chromedriver=2.28.455517 (2c6d2707d8ea850c862f04ac066724273981e88f),platform=Mac OS X 10.12.3 x86_64).
- unknown error: Element is not clickable at point (93, 334). Other element would receive the click: <li class="sf-megamenu-wrapper odd sf-item-1 sf-depth-1 sf-total-children-5 sf-parent-children-5 sf-single-children-0 menuparent">... \n (Session info: chrome=57.0.2987.110)\n (Driver info: chromedriver=2.28.455517 (2c6d2707d8ea850c862f04ac066724273981e88f),platform=Mac OS X 10.12.3 x86_64).
- Net::ReadTimeout.
- undefined local variable or method `browser' for main:Object.
- unexpected alert open: {Alert text : [object Object]}\n (Session info: chrome=57.0.2987.110)\n (Driver info: chromedriver=2.28.455517 (2c6d2707d8ea850c862f04ac066724273981e88f),platform=Mac OS X 10.12.3 x86_64).
- timed out after 30 seconds, waiting for #<Watir::TextField: located: false; {:id=>"MainContent_ctl00_txtNumriBiznesit", :tag_name=>"input"}> to be located.
- no such session\n (Driver info: chromedriver=2.28.455517 (2c6d2707d8ea850c862f04ac066724273981e88f),platform=Mac OS X 10.12.3 x86_64).
- Too many failed attempts to load search page: Net::ReadTimeout.
- timed out after 30 seconds, waiting for #<Watir::Anchor: located: false; {:xpath=>"//table[@class='views-table cols-4']/tbody//td/a", :tag_name=>"a"}> to be located.
- Too many failed attempts to load page via anchor click: timed out after 30 seconds, waiting for #<Watir::Anchor: located: false; {:xpath=>"//table[@class='views-table cols-4']/tbody//td/a", :tag_name=>"a"}> to be located.
- unknown error: Element <input name="ctl00$MainContent$ctl00$Submit1" type="submit" id="MainContent_ctl00_Submit1" value="Kërko"> is not clickable at point (93, 275). Other element would receive the click: ...\n (Session info: chrome=57.0.2987.110)\n (Driver info: chromedriver=2.28.455517 (2c6d2707d8ea850c862f04ac066724273981e88f),platform=Mac OS X 10.12.3 x86_64).
- browser window was closed.
You can count how many of each error type occurs with the following query:
db.errors.aggregate([
{$group :
{ _id : '$errorMsg', count : {$sum : 1}}
}
]).pretty()