powerplantmatching

A toolset for cleaning, standardizing and combining multiple power plant databases. See the for a more extensive insight.

WARNING (2017/03/08): To prune unvoluntarily introduced non-public data (copyrighted data) from our repository we were forced to substantially rewrite the git commit history. This means that merges with descendants of earlier commits will NOT apply cleanly and instead have to be processed manually. Also, the usual 'pull' procedure will fail to start working based on the new changes, you have to save your changes and then call either git reset origin/master or even git reset --hard origin/master (Note that this removes the commits on your current branch). We used the opportunity to move the continuously updating database bits to git large files, which you will have to install.

This package helps with simplifying the data collection of power plants. Information on power plants, particularly European ones is scattered over a few different projects and databases that are introducing their own different standards. Thus, we firstly provide functions to vertically clean databases and convert them into one coherent standard, which does not distinguish the units of a power plant. Secondly, we provide functions to horizontally merge different databases in order to check their consistency and improve the reliability.

powerplantmatching was initially developed by the Renewable Energy Group at FIAS to build power plant data inputs to PyPSA-based models for carrying out simulations for the CoNDyNet project, financed by the German Federal Ministry for Education and Research (BMBF) as part of the Stromnetze Research Initiative.

What it can do

clean and standardize power plant data sets
merge power plant units to one power plant
compare and combine different data sets
create lookups and give statistical insight to power plant goodness
provide cleaned data from different sources
provide an already merged data set of five different data-sources

Installation

Make sure that git lfs is installed, in case of doubt just run git lfs install
Copy or clone the repository to your preferred directory
Install the package via 'pip install -e /path/to/powerplantmatching'

Optional but recommended:

Download the ESE dataset. For integrating the data into powerplantmatching, the path of the downloaded file has to be added into the powerplantmatching/data/additional_data_config file with the keyword 'ese_path' (default is set to 'Downloads/projects.xls').
Add youre ENTSOE security token to the powerplantmatching\data/catching_data_config file for easily updating youre ENTSOE database. The token can be obtained by refering to section 2 of the RESTful API documentation of the ENTSOE-E Transparency platform.

Processed Data

If you are only interested in the power plant data, we provide our current merged dataset as a csv-file. This set combines the data of all the data sources listed in Data-Sources and provides the following information:

Power plant name - claim of each database
Fueltype - {Bioenergy, Geothermal, Hard Coal, Hydro, Lignite, Nuclear, Natural Gas, Oil, Solar, Wind, Other}
Classification - {CCGT, OCGT, Steam Turbine, Combustion Engine, Run-Of-River, Pumped Storage, Reservoir}
Set - {Power Plant (PP), Combined Heat and Power (CHP)}
Capacity - [MW]
Geo-position - Latitude, Longitude
Country - EU-27 + CH + NO (+ UK) minus Cyprus and Malta
YearCommissioned - Commmisioning year of the powerplant
File - Source file of the data entry
projectID - Identifier of the power plant in the source file

The following picture compares the total capacities per fuel type between the different data sources and our merged dataset.

The merged dataset is available in two versions: The bigger dataset links the entries of the matched power plant and lists all the related claims by the different data-sources. The smaller merged dataset reduces the former by applying a set of aggregation rules (shown below) for deciding the power plant parameters.

Argument	Rule
Name	Every name of the different databases
Fueltype	Most frequent claimed one
Classification	All different Classification in a row
Country	Take the uniquely stated country
Capacity	Mean
lat	Mean
lon	Mean
File	All files in a row
projectID	Python dictionary referencing all origi-
	nal powerplants that are included

Note that the claims for the country cannot differ, otherwise the power plants cannot match.

The merged dataset is also available as a further version that uses heuristics to fill the gaps.

Unmatched power plants from the OPSD data source are added so that the aggregated capacities per country and fueltype correspond closely to the ENTSOe statistics (except for Wind and Solar).
A learning algorithm fills the information about missing hydro classification (Run-of-River, Pumped Storage and Reservoir)
Additionally, a function that can be activated with a switch is provided that scales the hydro power plant capacities in order to fulfill all country totals.

The database is available using the python command

import powerplantmatching as pm
pm.collection.MATCHED_dataset()

or

import powerplantmatching as pm
pm.collection.MATCHED_dataset(rescaled_hydros=True)

if you want to scale hydro power plants.

Module Structure

The package consists of ten modules. For creating a new dataset you can make most use of the modules data, clean and match, which provide you with function for data supply, vertical cleaning and horizontal matching, respectively.

Combining Data From Different Sources - Horizontal Matching

Whereas single databases as the CARMA or the GEO database provide non standardized and incomplete information, the datasets can complement each other and improve their reliability. The merged dataset combines five different databases (see below) by only keeping powerplants which appear in more than one source.

The matching process heavily relies on DUKE, a java application specialized for deduplicating and linking data. It provides many built-in comparators such as numerical, string or geoposition comparators. The engine does a detailed comparison for each single argument (power plant name, fuel-type etc.) using adjusted comparators and weights. From the individual scores for each column it computes a compound score for the likeliness that the two powerplant records refer to the same powerplant. If the score exceeds a given threshold, the two records of the power plant are linked and merged into one data set.

Let's make that a bit more concrete by giving a quick example. Consider the following two data sets

Dataset 1:

	Name	Fueltype	Classification	Country	Capacity	lat	lon	File
0	Aarberg	Hydro	nan	Switzerland	14.609	47.0444	7.27578	nan
1	Abbey mills pumping	Oil	nan	United Kingdom	6.4	51.687	-0.0042057	nan
2	Abertay	Other	nan	United Kingdom	8	57.1785	-2.18679	nan
3	Aberthaw	Coal	nan	United Kingdom	1552.5	51.3875	-3.40675	nan
4	Ablass	Wind	nan	Germany	18	51.2333	12.95	nan
5	Abono	Coal	nan	Spain	921.7	43.5588	-5.72287	nan

and

Dataset 2:

	Name	Fueltype	Classification	Country	Capacity	lat	lon	File
0	Aarberg	Hydro	nan	Switzerland	15.5	47.0378	7.272	nan
1	Aberthaw	Coal	Thermal	United Kingdom	1500	51.3873	-3.4049	nan
2	Abono	Coal	Thermal	Spain	921.7	43.5528	-5.7231	nan
3	Abwinden asten	Hydro	nan	Austria	168	48.248	14.4305	nan
4	Aceca	Oil	CHP	Spain	629	39.941	-3.8569	nan
5	Aceca fenosa	Natural Gas	CCGT	Spain	400	39.9427	-3.8548	nan

Apparently the entries 0, 3 and 5 of Data set 1 relate to the same power plants as the entries 0,1 and 2 of Data set 2. Applying the matching algorithm to the two data sets, we obtain the following set:

	Dataset 1	Dataset 2	Country	Fueltype	Classification	Capacity	lat	lon	File
0	Aarberg	Aarberg	Switzerland	Hydro	nan	15.5	47.0411	7.27389	nan
1	Aberthaw	Aberthaw	United Kingdom	Coal	Thermal	1552.5	51.3874	-3.40583	nan
2	Abono	Abono	Spain	Coal	Thermal	921.7	43.5558	-5.72299	nan

Note, that the names from the different sources are kept for ease of referencing, whereas the claims about the other plant parameters have been reduced an aggregate value using the rules described in Processed data. The intermediary, unreduced dataset with all the claims is, of course, also available to provide a basis for your own reduction.

Vertical Cleaning

In order to compare and combine information from multiple databases, uni- form standards must be guaranteed. That is, the datasets should be based on the same set of arguments having consistent formats. With the module cleaning.py you can easily handle data alignment, that is, after renaming the basic columns of an unprocessed dataset, one simply has to apply several provided functions. Furthermore, you can aggregate power plant units from the same power plant together.

Data-Sources:

OPSD - Open Power System Data publish their data under a free license
GEO - Global Energy Observatory, the data is not directly available on the website, but can be obtained from an sqlite scraper
WRI - World Resource Institute provide their data under a free license on their powerwatch repository
CARMA - Carbon Monitoring for Action
ESE - Energy Storage Exchange provide a database for storage units. Especially the hydro storage data is of big use for a combining power plant database. Since the data is not free, it is optional and can be downloaded separately.
ENTSOe - European Network of Transmission System Operators for Electricity, annually provides statistics about aggregated power plant capacities which is available here Their data can be used as a validation reference. We further use their annual energy generation report from 2010 as an input for the hydro power plant classification.

Planned changes

Figuring out how to distinguish and deal with capacities given as net and gross values.
Add additional information like build year or efficiencies where available

and most importantly

We would welcome it, if a third-party institute with access to a commercial European powerplant database like PLATTS was interested in collaborating on validating the dataset against it. Please get in touch if you are!

Acknowledgements

The development of powerplantmatching was helped considerably by in-depth discussions and exchanges of ideas and code with

Chris Davis from University of Groningen and
Johannes Friedrich, Roman Hennig and Colin McCormick of the World Resources Institute

Licence

powerplantmatching is released as free software under the GPLv3, see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

powerplantmatching

What it can do

Installation

Processed Data

Module Structure

Combining Data From Different Sources - Horizontal Matching

Dataset 1:

Dataset 2:

Vertical Cleaning

Data-Sources:

Planned changes

Acknowledgements

Licence

Files

README.md

Latest commit

History

README.md

File metadata and controls

powerplantmatching

What it can do

Installation

Processed Data

Module Structure

Combining Data From Different Sources - Horizontal Matching

Dataset 1:

Dataset 2:

Vertical Cleaning

Data-Sources:

Planned changes

Acknowledgements

Licence