The OpenAlboPretorio project consists in a scraper library able to extract data from the Albo Pretorio Web archive present on Italian city institutional websites. The Albo Pretorio is the public archive containing all administrative acts that concern the Municipality administrive life. Extracted data can be exported in JSON (default) and Feed (rss2, atom) formats.
Clone the github repo in your machine
$ git clone https://github.com/gpirrotta/OpenAlboPretorio.git
$ cd OpenAlboPretorioAnd run these two commands to install it:
$ wget http://getcomposer.org/composer.phar
$ php composer.phar installNow you can add the autoloader, and you will have access to the library:
<?php
require 'vendor/autoload.php';You're done.
The OpenAlboPretorio class is the entry point of the library.
<?php
$albo = new OpenAlboPretorio();
$results = $albo->city(AlboPretorioScraperFactory::TERME_VIGLIATORE);
->open();
print $results // JSON format as defaultYou can also customize the scraper manually:
<?php
$scraper = new BarcellonaPGScraper(new BarcellonaPGMasterPageScraper(), new BarcellonaPGDetailPageScraper());
// you can also customize the Master and Detail scraper objects using i.e. different HttpAdapter objects
$scraper->setItemType(BarcellonaPGScraper::TIPOLOGIA_DETERMINAZIONE_DEL_SINDACO);
$formatter = new FeedFormatter(FeedFormatter::ATOM_FEED_TYPE); // RSS2 default
$albo = new OpenAlboPretorio();
$results = $albo->scrapeUsing($scraper)
->formatUsing($formatter)
->maxNumberItems(10)
->open();
print $results;city($city): set the city to scrape. The$cityparameter is theidcity of the Web page to scrape.
Examples:
<?php
$albo->city(AlboPretorioScraperFactory::TERME_VIGLIATORE);Alternatively to the city method you can set your customized scraper using
scrapeUsing(AlboPretorioScraperInterface $scraper)method.
You can customized the scraped results with:
-
maxNumberItems($maxNumberItems): set the maximum number of items to scrape; -
formatUsing(FormatterInterface $formatter): set the output format. The default is JSON format but you can choose also the Feed (rss2, atom) format; -
open(): it starts the game returning the scraped data.
Currently the following scrapers are implemented:
TermeVigliatoreScraper(Scrapes data from Terme Vigliatore (ME))BarcellonaPGScraper(Scrapes data from Barcellona Pozzo di Gotto (ME)
Formatters available:
-
JSONFormatter: (default) formats the scraped results in JSON format; -
FeedFormatter($type): formats the scraped results in Feed format; type parameter can be one of these:FeedFormatter::RSS_FEED_TYPE(default)FeedFormatter::ATOM_FEED_TYPE
###Extending the OpenAlboPretorio project
If you want to extend the Albo Pretorio project for your city you have to implement the AlboPretorioScraperInterface interface.
Generally scraping an Albo Pretorio Web page means extract data from two pages:
-
the
Master page- the Web page containing the list of all Albo Pretorio items, i.e. all administrative acts of the Municipality, where you can find the summary of the last item published including the URL of each item; -
the
Detail page- the single Web page item where you can find the detail of each administrative act.
To manage correctly the above described scraping logic the OpenAlboPretorio library provides the
AbstractMasterDetailTemplateScraper abstract class implementing the AlboPretorioScraperInterface interface.
The abstract class uses the following scraper interfaces:
MasterPageScraperInterface- retrieves the list of item URLs to scrape;DetailPageScraperInterface- retrieves the detail of each single administrative act, i.e. an item. - it must return anAlboPretorioItemclass
Obviously if the extraction logic of your Albo Pretorio Web page is different from the Master-Detail you are free to implement the one that meets your needs.
-
= PHP 5.4
$ phpunit- Albo Pretorio Terme Vigliatore JSON RSS2 ATOM
- Albo Pretorio Barcellona Pozzo di Gotto JSON RSS2 ATOM
- Improve test coverage
- Add scrapers
- Add formatters
- Giovanni Pirrotta giovanni.pirrotta@gmail.com
OpenAlboPretorio is released under the MIT License. See the bundled LICENSE file for details.
