Skip to content

MYMahfouz/powerplantmatching

 
 

Repository files navigation

powerplantmatching

A toolset for cleaning, standardizing and combining multiple power plant databases. See the Documentation for a more extensive insight.

WARNING (2017/03/08): To prune unvoluntarily introduced non-public data (copyrighted data) from our repository we were forced to substantially rewrite the git commit history. This means that merges with descendants of earlier commits will NOT apply cleanly and instead have to be processed manually. Also, the usual 'pull' procedure will fail to start working based on the new changes, you have to save your changes and then call either git reset origin/master or even git reset --hard origin/master (Note that this removes the commits on your current branch). We used the opportunity to move the continuously updating database bits to git large files, which you will have to install.

This package helps with simplifying the data collection of power plants. Information on power plants, particularly European ones is scattered over a few different projects and databases that are introducing their own different standards. Thus, we firstly provide functions to vertically clean databases and convert them into one coherent standard, which does not distinguish the units of a power plant. Secondly, we provide functions to horizontally merge different databases in order to check their consistency and improve the reliability.

Map of power plants in Europe

powerplantmatching was initially developed by the Renewable Energy Group at FIAS to build power plant data inputs to PyPSA-based models for carrying out simulations for the CoNDyNet project, financed by the German Federal Ministry for Education and Research (BMBF) as part of the Stromnetze Research Initiative.

What it can do

  • clean and standardize power plant data sets
  • merge power plant units to one power plant
  • compare and combine different data sets
  • create lookups and give statistical insight to power plant goodness
  • provide cleaned data from different sources
  • provide an already merged data set of five different data-sources

Installation

  1. Make sure that git lfs is installed, in case of doubt just run git lfs install
  2. Copy or clone the repository to your preferred directory
  3. Install the package via 'pip install -e /path/to/powerplantmatching'

Optional but recommended:

  1. Download the ESE dataset. For integrating the data into powerplantmatching, the path of the downloaded file has to be added into the powerplantmatching/data/additional_data_config file with the keyword 'ese_path' (default is set to 'Downloads/projects.xls').
  2. Add youre ENTSOE security token to the powerplantmatching\data/catching_data_config file for easily updating youre ENTSOE database. The token can be obtained by refering to section 2 of the RESTful API documentation of the ENTSOE-E Transparency platform.

Processed Data

If you are only interested in the power plant data, we provide our current merged dataset as a csv-file. This set combines the data of all the data sources listed in Data-Sources and provides the following information:

  • Power plant name - claim of each database
  • Fueltype - {Bioenergy, Geothermal, Hard Coal, Hydro, Lignite, Nuclear, Natural Gas, Oil, Solar, Wind, Other}
  • Classification - {CCGT, OCGT, Steam Turbine, Combustion Engine, Run-Of-River, Pumped Storage, Reservoir}
  • Set - {Power Plant (PP), Combined Heat and Power (CHP)}
  • Capacity - [MW]
  • Geo-position - Latitude, Longitude
  • Country - EU-27 + CH + NO (+ UK) minus Cyprus and Malta
  • YearCommissioned - Commmisioning year of the powerplant
  • File - Source file of the data entry
  • projectID - Identifier of the power plant in the source file

The following picture compares the total capacities per fuel type between the different data sources and our merged dataset.

Total capacities per fuel type for the different data sources and the merged dataset.

The merged dataset is available in two versions: The bigger dataset links the entries of the matched power plant and lists all the related claims by the different data-sources. The smaller merged dataset reduces the former by applying a set of aggregation rules (shown below) for deciding the power plant parameters.

Argument Rule
Name Every name of the different databases
Fueltype Most frequent claimed one
Classification All different Classification in a row
Country Take the uniquely stated country
Capacity Mean
lat Mean
lon Mean
File All files in a row
projectID Python dictionary referencing all origi-
nal powerplants that are included

Note that the claims for the country cannot differ, otherwise the power plants cannot match.

Power plant coverage

The merged dataset is also available as a further version that uses heuristics to fill the gaps.

  • Unmatched power plants from the OPSD data source are added so that the aggregated capacities per country and fueltype correspond closely to the ENTSOe statistics (except for Wind and Solar).

  • A learning algorithm fills the information about missing hydro classification (Run-of-River, Pumped Storage and Reservoir)

  • Additionally, a function that can be activated with a switch is provided that scales the hydro power plant capacities in order to fulfill all country totals.

The database is available using the python command

import powerplantmatching as pm
pm.collection.MATCHED_dataset() 

or

import powerplantmatching as pm
pm.collection.MATCHED_dataset(rescaled_hydros=True)

if you want to scale hydro power plants.

Module Structure

The package consists of ten modules. For creating a new dataset you can make most use of the modules data, clean and match, which provide you with function for data supply, vertical cleaning and horizontal matching, respectively.

Modular package structure

Combining Data From Different Sources - Horizontal Matching

Whereas single databases as the CARMA or the GEO database provide non standardized and incomplete information, the datasets can complement each other and improve their reliability. The merged dataset combines five different databases (see below) by only keeping powerplants which appear in more than one source.

The matching process heavily relies on DUKE, a java application specialized for deduplicating and linking data. It provides many built-in comparators such as numerical, string or geoposition comparators. The engine does a detailed comparison for each single argument (power plant name, fuel-type etc.) using adjusted comparators and weights. From the individual scores for each column it computes a compound score for the likeliness that the two powerplant records refer to the same powerplant. If the score exceeds a given threshold, the two records of the power plant are linked and merged into one data set.

Let's make that a bit more concrete by giving a quick example. Consider the following two data sets

Dataset 1:

Name Fueltype Classification Country Capacity lat lon File
0 Aarberg Hydro nan Switzerland 14.609 47.0444 7.27578 nan
1 Abbey mills pumping Oil nan United Kingdom 6.4 51.687 -0.0042057 nan
2 Abertay Other nan United Kingdom 8 57.1785 -2.18679 nan
3 Aberthaw Coal nan United Kingdom 1552.5 51.3875 -3.40675 nan
4 Ablass Wind nan Germany 18 51.2333 12.95 nan
5 Abono Coal nan Spain 921.7 43.5588 -5.72287 nan

and

Dataset 2:

Name Fueltype Classification Country Capacity lat lon File
0 Aarberg Hydro nan Switzerland 15.5 47.0378 7.272 nan
1 Aberthaw Coal Thermal United Kingdom 1500 51.3873 -3.4049 nan
2 Abono Coal Thermal Spain 921.7 43.5528 -5.7231 nan
3 Abwinden asten Hydro nan Austria 168 48.248 14.4305 nan
4 Aceca Oil CHP Spain 629 39.941 -3.8569 nan
5 Aceca fenosa Natural Gas CCGT Spain 400 39.9427 -3.8548 nan

Apparently the entries 0, 3 and 5 of Data set 1 relate to the same power plants as the entries 0,1 and 2 of Data set 2. Applying the matching algorithm to the two data sets, we obtain the following set:

Dataset 1 Dataset 2 Country Fueltype Classification Capacity lat lon File
0 Aarberg Aarberg Switzerland Hydro nan 15.5 47.0411 7.27389 nan
1 Aberthaw Aberthaw United Kingdom Coal Thermal 1552.5 51.3874 -3.40583 nan
2 Abono Abono Spain Coal Thermal 921.7 43.5558 -5.72299 nan

Note, that the names from the different sources are kept for ease of referencing, whereas the claims about the other plant parameters have been reduced an aggregate value using the rules described in Processed data. The intermediary, unreduced dataset with all the claims is, of course, also available to provide a basis for your own reduction.

Vertical Cleaning

In order to compare and combine information from multiple databases, uni- form standards must be guaranteed. That is, the datasets should be based on the same set of arguments having consistent formats. With the module cleaning.py you can easily handle data alignment, that is, after renaming the basic columns of an unprocessed dataset, one simply has to apply several provided functions. Furthermore, you can aggregate power plant units from the same power plant together.

Data-Sources:

Planned changes

  • Figuring out how to distinguish and deal with capacities given as net and gross values.
  • Add additional information like build year or efficiencies where available

and most importantly

We would welcome it, if a third-party institute with access to a commercial European powerplant database like PLATTS was interested in collaborating on validating the dataset against it. Please get in touch if you are!

Acknowledgements

The development of powerplantmatching was helped considerably by in-depth discussions and exchanges of ideas and code with

  • Chris Davis from University of Groningen and
  • Johannes Friedrich, Roman Hennig and Colin McCormick of the World Resources Institute

Licence

powerplantmatching is released as free software under the GPLv3, see LICENSE.

About

Set of tools to combine multiple power plant databases

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 100.0%