A toolset for cleaning, standardizing and combining multiple power plant databases. See the for a more extensive insight.
WARNING (2017/03/08): To prune unvoluntarily introduced non-public
data (copyrighted data) from our repository we were forced to
substantially rewrite the git commit history. This means that merges
with descendants of earlier commits will NOT apply cleanly and instead
have to be processed manually. Also, the usual 'pull' procedure will
fail to start working based on the new changes, you have to save your
changes and then call either git reset origin/master
or even git reset --hard origin/master
(Note that this removes the commits on
your current branch). We used the opportunity to move the
continuously updating database bits to git large files, which you will
have to install.
This package helps with simplifying the data collection of power plants. Information on power plants, particularly European ones is scattered over a few different projects and databases that are introducing their own different standards. Thus, we firstly provide functions to vertically clean databases and convert them into one coherent standard, which does not distinguish the units of a power plant. Secondly, we provide functions to horizontally merge different databases in order to check their consistency and improve the reliability.
powerplantmatching was initially developed by the Renewable Energy Group at FIAS to build power plant data inputs to PyPSA-based models for carrying out simulations for the CoNDyNet project, financed by the German Federal Ministry for Education and Research (BMBF) as part of the Stromnetze Research Initiative.
- clean and standardize power plant data sets
- merge power plant units to one power plant
- compare and combine different data sets
- create lookups and give statistical insight to power plant goodness
- provide cleaned data from different sources
- provide an already merged data set of five different data-sources
- Make sure that git lfs is installed, in case of doubt just run
git lfs install
- Copy or clone the repository to your preferred directory
- Install the package via 'pip install -e /path/to/powerplantmatching'
Optional but recommended:
- Download the ESE dataset. For integrating the data into powerplantmatching, the path of the downloaded file has to be added into the powerplantmatching/data/additional_data_config file with the keyword 'ese_path' (default is set to 'Downloads/projects.xls').
- Add youre ENTSOE security token to the powerplantmatching\data/catching_data_config file for easily updating youre ENTSOE database. The token can be obtained by refering to section 2 of the RESTful API documentation of the ENTSOE-E Transparency platform.
If you are only interested in the power plant data, we provide our current merged dataset as a csv-file. This set combines the data of all the data sources listed in Data-Sources and provides the following information:
- Power plant name - claim of each database
- Fueltype - {Bioenergy, Geothermal, Hard Coal, Hydro, Lignite, Nuclear, Natural Gas, Oil, Solar, Wind, Other}
- Classification - {CCGT, OCGT, Steam Turbine, Combustion Engine, Run-Of-River, Pumped Storage, Reservoir}
- Set - {Power Plant (PP), Combined Heat and Power (CHP)}
- Capacity - [MW]
- Geo-position - Latitude, Longitude
- Country - EU-27 + CH + NO (+ UK) minus Cyprus and Malta
- YearCommissioned - Commmisioning year of the powerplant
- File - Source file of the data entry
- projectID - Identifier of the power plant in the source file
The following picture compares the total capacities per fuel type between the different data sources and our merged dataset.
The merged dataset is available in two versions: The bigger dataset links the entries of the matched power plant and lists all the related claims by the different data-sources. The smaller merged dataset reduces the former by applying a set of aggregation rules (shown below) for deciding the power plant parameters.
Argument | Rule |
---|---|
Name | Every name of the different databases |
Fueltype | Most frequent claimed one |
Classification | All different Classification in a row |
Country | Take the uniquely stated country |
Capacity | Mean |
lat | Mean |
lon | Mean |
File | All files in a row |
projectID | Python dictionary referencing all origi- |
nal powerplants that are included |
Note that the claims for the country cannot differ, otherwise the power plants cannot match.
The merged dataset is also available as a further version that uses heuristics to fill the gaps.
-
Unmatched power plants from the OPSD data source are added so that the aggregated capacities per country and fueltype correspond closely to the ENTSOe statistics (except for Wind and Solar).
-
A learning algorithm fills the information about missing hydro classification (Run-of-River, Pumped Storage and Reservoir)
-
Additionally, a function that can be activated with a switch is provided that scales the hydro power plant capacities in order to fulfill all country totals.
The database is available using the python command
import powerplantmatching as pm
pm.collection.MATCHED_dataset()
or
import powerplantmatching as pm
pm.collection.MATCHED_dataset(rescaled_hydros=True)
if you want to scale hydro power plants.
The package consists of ten modules. For creating a new dataset you can make most use of the modules data, clean and match, which provide you with function for data supply, vertical cleaning and horizontal matching, respectively.
Whereas single databases as the CARMA or the GEO database provide non standardized and incomplete information, the datasets can complement each other and improve their reliability. The merged dataset combines five different databases (see below) by only keeping powerplants which appear in more than one source.
The matching process heavily relies on DUKE, a java application specialized for deduplicating and linking data. It provides many built-in comparators such as numerical, string or geoposition comparators. The engine does a detailed comparison for each single argument (power plant name, fuel-type etc.) using adjusted comparators and weights. From the individual scores for each column it computes a compound score for the likeliness that the two powerplant records refer to the same powerplant. If the score exceeds a given threshold, the two records of the power plant are linked and merged into one data set.
Let's make that a bit more concrete by giving a quick example. Consider the following two data sets
Name | Fueltype | Classification | Country | Capacity | lat | lon | File | |
---|---|---|---|---|---|---|---|---|
0 | Aarberg | Hydro | nan | Switzerland | 14.609 | 47.0444 | 7.27578 | nan |
1 | Abbey mills pumping | Oil | nan | United Kingdom | 6.4 | 51.687 | -0.0042057 | nan |
2 | Abertay | Other | nan | United Kingdom | 8 | 57.1785 | -2.18679 | nan |
3 | Aberthaw | Coal | nan | United Kingdom | 1552.5 | 51.3875 | -3.40675 | nan |
4 | Ablass | Wind | nan | Germany | 18 | 51.2333 | 12.95 | nan |
5 | Abono | Coal | nan | Spain | 921.7 | 43.5588 | -5.72287 | nan |
and
Name | Fueltype | Classification | Country | Capacity | lat | lon | File | |
---|---|---|---|---|---|---|---|---|
0 | Aarberg | Hydro | nan | Switzerland | 15.5 | 47.0378 | 7.272 | nan |
1 | Aberthaw | Coal | Thermal | United Kingdom | 1500 | 51.3873 | -3.4049 | nan |
2 | Abono | Coal | Thermal | Spain | 921.7 | 43.5528 | -5.7231 | nan |
3 | Abwinden asten | Hydro | nan | Austria | 168 | 48.248 | 14.4305 | nan |
4 | Aceca | Oil | CHP | Spain | 629 | 39.941 | -3.8569 | nan |
5 | Aceca fenosa | Natural Gas | CCGT | Spain | 400 | 39.9427 | -3.8548 | nan |
Apparently the entries 0, 3 and 5 of Data set 1 relate to the same power plants as the entries 0,1 and 2 of Data set 2. Applying the matching algorithm to the two data sets, we obtain the following set:
Dataset 1 | Dataset 2 | Country | Fueltype | Classification | Capacity | lat | lon | File | |
---|---|---|---|---|---|---|---|---|---|
0 | Aarberg | Aarberg | Switzerland | Hydro | nan | 15.5 | 47.0411 | 7.27389 | nan |
1 | Aberthaw | Aberthaw | United Kingdom | Coal | Thermal | 1552.5 | 51.3874 | -3.40583 | nan |
2 | Abono | Abono | Spain | Coal | Thermal | 921.7 | 43.5558 | -5.72299 | nan |
Note, that the names from the different sources are kept for ease of referencing, whereas the claims about the other plant parameters have been reduced an aggregate value using the rules described in Processed data. The intermediary, unreduced dataset with all the claims is, of course, also available to provide a basis for your own reduction.
In order to compare and combine information from multiple databases, uni- form standards must be guaranteed. That is, the datasets should be based on the same set of arguments having consistent formats. With the module cleaning.py you can easily handle data alignment, that is, after renaming the basic columns of an unprocessed dataset, one simply has to apply several provided functions. Furthermore, you can aggregate power plant units from the same power plant together.
- OPSD - Open Power System Data publish their data under a free license
- GEO - Global Energy Observatory, the data is not directly available on the website, but can be obtained from an sqlite scraper
- WRI - World Resource Institute provide their data under a free license on their powerwatch repository
- CARMA - Carbon Monitoring for Action
- ESE - Energy Storage Exchange provide a database for storage units. Especially the hydro storage data is of big use for a combining power plant database. Since the data is not free, it is optional and can be downloaded separately.
- ENTSOe - European Network of Transmission System Operators for Electricity, annually provides statistics about aggregated power plant capacities which is available here Their data can be used as a validation reference. We further use their annual energy generation report from 2010 as an input for the hydro power plant classification.
- Figuring out how to distinguish and deal with capacities given as net and gross values.
- Add additional information like build year or efficiencies where available
and most importantly
We would welcome it, if a third-party institute with access to a commercial European powerplant database like PLATTS was interested in collaborating on validating the dataset against it. Please get in touch if you are!
The development of powerplantmatching was helped considerably by in-depth discussions and exchanges of ideas and code with
- Chris Davis from University of Groningen and
- Johannes Friedrich, Roman Hennig and Colin McCormick of the World Resources Institute
powerplantmatching is released as free software under the GPLv3, see LICENSE.