PostgreSQL PGCon 2020 全视频

作者

digoal

日期

2021-07-02

背景

https://av.tib.eu/media/52153

https://av.tib.eu/series/1052/pgcon+2020

Advanced Data Modeling Techniques in PostgreSQL

2:09:41 8 Travers, Christopher
This tutorial focuses on how to use advanced features in PostgreSQL to simplify/improve data modeling. Functional indexes, table methods, and custom types are all covered, as is table inheritance as a data modeling (instead of data partitioning) tool.

Asynchronous IO for PostgreSQL

47:38 5 Freund, Andres
For many workloads PostgresSQL currently cannot take full advantage of modern storage systems, like e.g. good SSD. But even for plain old spinning disks, we can achieve higher throughput. One of the major reasons for that is that the majority of storage IO postgres performs is done synchronously (see e.g. slide 8fff in https://anarazel.de/talks/2019-10-16-pgconf-milan-io/io.pdf for an illustration as to why that is a problem). This talk will discuss the outcome of a prototype to add asynchronous IO support to PostgreSQL. The talk will discuss: * How would async IO support for PG look like architecturally? * Performance numbers * What sub-problems exist that can be integrated separately * Currently that prototype uses linux' new io uring asynchronous IO support * Which other OSs can be supported? Do we need to provide emulation of OS level async IO? Note that support for asynchronous IO is not directly the same as support for direct IO. This talk will mainly focus on asynchronicity, and direct IO only secondarily.

Avoiding, Detecting, and Recovering From Data Corruption

54:18 8 Haas, Robert
PostgreSQL databases can become corrupted for a variety of reasons, including hardware failure, software failure, and user error. In this talk, I’ll talk about some of my experiences with database corruption. In particular, I’ll mention some of the things which seem to be common causes of database corruption, such as procedural errors taking or restoring backups; some of the ways that database corruption most often manifests when it does occur, such as errors indicating inconsistencies between a table and its indexes or a table and its toast table; and a little bit about techniques that I have seen used to repair databases or recover from corruption, including some experiences with pg resetxlog. This talk will be based mostly on my experiences working with EnterpriseDB customers; I hope that it will be useful to hackers from the point of view of thinking about possible improvements to PostgreSQL, and to end users from the point of view of helping them avoid, diagnose, and cope with corruption.

Building automatic adviser & performance tuning tools

27:49 4 Rouhaud, Julien et al.
PostgreSQL is a mature and robust RDBMS since it has 30 years of history. Over the year, its query optimizer has been enhanced and usually produces good query plans. However, can it always come up with good query plans? The optimization process has to use some assumptions to produce plans fast enough. Some of those assumptions are relatively easy to check (e.g. statistics are up-to-date), some harder (e.g. correct indexes are created), and some nearly impossible (e.g. making sure that the statistic samples are representative enough even for skewed data repartition). For now, given those various caveats, DBA sometimes can't always realize easily that they miss a chance to get a meaningful performance improvement. To help DBA to get a truly good query plan, we'll present below some tools that can help to fix some of those problems by providing a missing index adviser, looking for extended statistics to create, and row estimation error correction information to get appropriate join orders with join methods automatically. - pg qualstats: provides a new index and extended statistics suggestions to gather many predicate statistics on the production workload. - pg plan advsr: provides alternative good query plans automatically to analyze iterative query executions information to fix estimation rows error. In this talk, we will explain how those tools work under the hood and see what can be done, how they can work together. Also, we will mention what other tools also exist for related problems. Therefore, it will be useful for DBA who are interested in improving query performance or want to check whether current settings of indexes and statistics are adequate.

Building Better Benchmarks

29:56 4 Wong, Mark
TPC, SSBM, YCSB... These are just some of the database benchmarks that are out in the wild. Yet there are many derivatives, implementations and variations of these and other benchmarks. The differences between them vary between ease of use, and effectiveness at testing at large scales. End-user licensing agreements and licensing of the software being used can be just as prohibitive as the test itself for being able to measure the performance of your software. "Better" is certainly subjective. From the point of view of open source software developers, we want something that: * can push the limits our of software and hardware * we are allowed to modify and change to better suit our needs * can be shared with colleagues and the world about the achievements made The current state of the art is surveyed for what is available to help us now and what hurdles remain in order for us to be able to do more. These hurdles only remain as opportunities for us to overcome with open source solutions.

Choose Your Own Adventure Postgres 12 By Example

48:57 3 Treat, Robert
You know Postgres 12 is available, but you just haven't had time to look into it. Sure you want to learn more, but no one wants to just sit through a bunch of slides. What you really want is to see the features in Postgres 12 so you can figure out what will help make your users (and you!) happier with as little work as possible. I’m Robert Treat, and I’ve put together a whirlwind tour of new features available in Postgres 12. I think I know what will help you get the most bang for the buck, but if I am wrong, we can chose the features that you want to see most. That's right, there are no slides in this presentation, not even an about-me slide; we're going straight into the database to help you learn what options will help you make the best case with management to get you upgraded.

Community roadmap to sharding

38:26 2 Korotkov, Alexander
Sharding is one of most wanted PostgreSQL features. Vertical scalability is limited by hardware. Replication doesn't provide scalability in read-write performance and database size. Only sharding promises this. However, such a brilliant feature is hard to implement. And here is the point where different community parties should work together. The PostgreSQL community has done a great work on foreign data wrappers and continues improving them. Postgres Pro has experience in distributed transactions, snapshots and query planning/execution. In this talk we will cover existing advances in sharding and present a roadmap of transforming them into a comprehensive sharding solution suitable for the main use cases.

CREATE STATISTICS: What is it for

48:48 6 Vondra, Tomas
One of the new features in PostgreSQL 10 is the ability to create multi-column statistics, helping the optimizer understand dependencies between columns. PostgreSQL 12 further improved the feature by supporting more complicated statistics types. This talk will explain a bunch of important questions - why we need this capability at all for some queries, how it works, which cases it can address currently, and which improvements are in the queue for PostgreSQL 13.

Deep Thoughts: Betting on Security

1:02:58 1 Conway, Joe
For all combinations of who/which/what, list who currently has which type of permissions for what objects in your most important database(s)? It shouldn’t be that hard to figure out, right? Would you be willing to bet your job that you have given a complete and correct answer, even if this were an open book test with access to the server and ample time? This talk seeks to dive deep into the weeds on the topic of how roles interact with Postgres default behaviors, role attributes, and object privileges, resulting in a particular discretionary access control (DAC) security posture.

Distributed snapshots and global deadlock detection

42:59 1 Praveen, Asim Rama et al.
In this talk we would like to share our experiences in implementing MVCC in an open source scale out environment based on PostgreSQL. Scale out environment is characterised by multiple PostgreSQL servers with one master PostgreSQL server designated as the entry point for clients. Transactions may update data residing on more than one PostgreSQL instance. Isolating two or more such transactions running concurrently is a major challenge in a scale out system. Different isolation levels have their own challenges, serialisable being the hardest to implement efficiently. Distributed snapshots enable individual PostgreSQL instances within a scale out system to determine status (in-progress, committed, aborted) of a transaction and to decide whether effects of a transaction are visible using a given snapshot. The talk will go over this distributed snapshot mechanism in detail including several corner cases that we found tricky to implement right. Note that each PostgreSQL instance in the scale out system continues to create local snapshots and local transactions. Distributed deadlock occurs when the wait cycle spans multiple PostgreSQL instances. To detect it, wait graphs from each PostgreSQL instance need to aggregated and cycle detecting be performed on the aggregated data. The talk will describe how we model vertices and edges in such a graph, the method used to aggregate this information, being mindful of performance of the distributed system.

Find your slow queries, and fix them!

50:33 7 Frost, Stephen
Where, oh where, is all that time going? What in the world is that database thing doing?! This talk will help you understand what's happening (and why) and how to analyze poor query performance. We'll also go over steps and strategies to take to improve them and get the performance (and scalability!) you need. It all starts with figuring out what queries are slow, we'll do that by going into the various PostgreSQL configuration options for logging queries and a few helpful modules for getting even more information about ongoing queries. Next we'll go over EXPLAIN and EXPLAIN ANALYZE output for select queries, what the EXPLAIN output means in terms of how the query is being executed. Lastly (this is the good part- you have to stay til the end to get it!) we'll go over ways to improve the queries, including index creation, rewriting the query to allow PG to use a different plan, and how to tune parameters for specific queries.

Hacking the Query Planner, Again

33:05 3 Guo, Richard
This talk will focus on how the planner works from a developer’s view and elaborate on the process of converting a query tree to a plan tree in details. We can divide the planning process into 4 phases: preprocessing, scan/join planning, post scan/join planning and postprocessing. In this talk, I will explain what stuff is performed in each phase and what is the motivation to perform that. Topics will include: transforming ANY/EXISTS SubLinks into joins, flattening sub-selects, preprocessing expressions, reducing outer joins to inner joins, distributing quals to rels, collecting join ordering restrictions, removing useless joins, join searching process, upper planner path-ification, cost estimation, etc. Tom Lane's 2011 talk "Hacking the Query Planner” talked about the overview of query planner. In this talk, I will cover the internals of query planner with more details and in a way more close to planner codes. I hope this will be helpful in understanding the internals of PostgreSQL's planner and in hacking the planner codes to improve it.

Implementing System Versioned Temporal Table

22:00 6 Temesgen Mamo, Surafel
SQL standard 2011 introduce a temporal table concept which describe a table that store data with respect to time instance. It has two part application time period table and system versioned temporal table. An application time period are for meeting the requirements of applications based on application specific time periods which is valid in the business world rather than on the basis of the time it was entered in the database and system versioned temporal table state about the retention of old data alone with current data automatically by database management system and ability of queering both current and history data. A table can also be both a system versioned table and an application time period table.

Improve of Parallel Query when getting large scale records

32:02 3 Hirose, Shigeo
IoT system has many of the edge computer and sensor devices. They are increasing to thousand and more, they have a lot of data. PostgreSQL provides a Parallel Query to collect such data efficiently. But data Gather node is a bottleneck because it cannot process in parallel. we tried to improve the speed of Gather node.I will explain some solutions such as Gather node parallelization and bulk copy records in Gather node.

Index: Do It Yourself

36:56 3 Borodin, Andrey
In Postgres, we already have the infrastructure for building index-as-extension, but there are not so many such extensions to date. But there are so many discussions of on how to make core indexes better. This is a talk about extracting index from core to extension and what can be done with usual indexes. Some of these optimizations are discussed in @hackers and can be expected in the core, others will never be more than extension. We will discuss ideas from academic researches and corresponding industrial response from developers, communities, and companies. There will be a short live-coding session on creating a DIY index in Postgres. I'll show how to extract access method from core to extension in 5 minutes and talk about ideas for enhancing indexes: learned indexes, removing opclasses in favour of specialized indexes, cache prefetches, advanced generalized search (GiST alternative) and some others.

Introducing LSM-tree into PostgreSQL, making it as a data gobbler

34:37 3 Jin, Shichao
Data storage engine is always the key to the performance of a database system. Traditionally, storage engine is implemented based on B+-tree or B-tree. Compared to traditional B+tree or B-tree, LSM-tree as another important data structure, recently attracts great attention from developers due to its wide usage in many open source projects including in web browsers and many data processing systems. There are some well-known incarnation of LSM-tree, such as LevelDB and RocksDB to make the index practical in the industry. Now, we bring it to the world of PostgreSQL. Users can utilize LSM-tree via Postgres‘ FDW through our implementation. So in the talk, we will cover these contents: 1. The motivation of introducing LSM-tree 2. How does LSM-tree work? 3. How does the specific implementation of LSM-tree work, such as LevelDB and RocksDB? What is their architecture? 4. How do we incorporate these existing LSM-tree implementations? 5. Some experiment results

JSON[b] Roadmap

51:53 3 Bartunov, Oleg
JSONB in PostgreSQL is one of the main attractive feature for modern application developers, no matter what some RDBMS purists are thinking. PG12 and upcoming PG13 complete the implementation of SQL/JSON standard, which is a great feature itself, but now it is time to think about improving the implementation in both directions - functionality and performance. For example, PostgreSQL recognized for its extensibility and it is natural to add support of user's data types and operators to JSONPATH, make use of JSONPATH to index json data, optimize access to specific key:value pairs for large json, and so on. Another important topic I want to discuss - is the single JSON data type. Historically, we have two JSON data types - textual and binary (better), the latter, called JSONB, is actually the most usable and popular data type. Now, looking forward to expected specification of JSON data type in future SQL standard, we have to decide how to make JSON data type to be generic and cover the old JSON, JSONB data types.

Mechanical Sympathy for Elephants: Reducing I/O and Memory Stalls

30:28 2 Munro, Thomas
This talk looks at the mechanics of memory and storage, and discusses a selection of opportunities for PostgreSQL to reduce stalls and improve performance. These include experimental and committed work done in the PostgreSQL and OS communities, along with some relevant ideas and observations found in academic papers. The following topics will be covered: * avoiding I/O stalls in recovery, index scans and joins * limiting I/O streams for parallel queries and concurrent queries * avoiding memory stalls for hash joins, sequential scans, index searches * avoiding branches through inlining and specialisation * reducing TLB misses for data and code

pgagroal: High-performance connection pool for PostgreSQL.

25:53 4 Pedersen, Jesper
pgagroal is a high performance protocol-native connection pool for PostgreSQL. pgagroal is built upon libev, a high performance network library, and uses a shared-memory process model together with atomic operations for tracking state to achieve its high performance. In this session we will explore the * Architecture * Features * Deployment * Performance * Roadmap of pgagroal.

PostgreSQL HA Cluster with Query Load Balancing on Kubernetes

40:59 8 Peng, Bo
Kubernetes is an open source container orchestration platform for automating deployment, scaling and management of application containers. Nowadays, more and more applications are being deployed in containers on Kubernetes. However, managing database on Kubernetes is difficult because the data redundancy can have very specific requirements. There are several solutions that can help us to run a PostgreSQL HA cluster on Kubernetes. However, these solutions don't provide query load balancing functionality. Therefore, users need to modify the application to distribute read queries among multiple PostgreSQL containers. In this talk, I would like to give you an overview of Kubernetes and explain how to deploy a PostgreSQL HA cluster with query load balancing using Pgpool-II on Kubernetes. Although Kubernetes provides a way to monitor the basic cluster status, this is not sufficient for practical cluster management. In this talk, I will introduce the way to implement custom Prometheus metrics to monitor PostgreSQL cluster status and performance for extensive cluster monitoring.

PostgreSQL on K8s at Zalando: Two years in production

46:52 5 Kukushkin, Alexander
Many DBAs avoid any kind of cloud offering and prefer to run their databases on dedicated hardware. At the same time companies demand to run Postgres at scale, efficiently, automated and well integrated into the infrastructure landscape. The arrival of Kubernetes provided good building blocks and an API to interact with and with it solve many problems at the infrastructure level. The database team at Zalando started running highly-available PostgreSQL clusters on Kubernetes more than two years ago. In this talk I am going to share how we automate all routine operations, providing developers with easy-to-use tools to create, manage and monitor their databases, avoiding commercial solutions lock-in and saving costs, show open-source tools we have built to deploy and manage PostgreSQL cluster on Kubernetes by writing short manifests describing a few essential properties of the result. Operating a few hundred PostgreSQL clusters in a containerized environment has also generated observations and learnings which we want to share: infrastructure problems (AWS), how engineers use our Postgres setup and what happens when the load becomes critical. * [Zalando Postres-Operator] * [Zalando Patroni]

Progress adding SQL:2011 valid time to Postgres

51:43 2 Jungwirth, Paul A.
I have a patch in progress to add SQL:2011 valid time support to Postgres, so that you can more easily record a history of things that change over time. It lets you define temporal primary and foreign keys and issue temporal UPDATE and DELETE commands. Although the SQL:2011 standard introduces a new concept called PERIODs, I've based the Postgres features on our own built-in ranges---and plan to support PERIODs too (hopefully before I get to PGCon).

Protect Your PostgreSQL Passwords: How SCRAM Works & Why You Need It

58:27 3 Katz, Jonathan S.
Passwords: they just seem to work. You connect to your PostgreSQL database and you are prompted for your password. You type in the correct character combination, and presto! you're in, safe and sound. But what if I told you that all was not as it seemed. What if I told you there was a better, safer way to use passwords with PostgreSQL? What if I told you it was imperative that you upgraded, too?

Ptrack 2.0: yet another block-level incremental backup engine

42:41 3 Kondratov, Alexey
Ptrack has been developed a few years ago and provided a way to track page-level changes of the PostgreSQL database data. This information was used for implementation of block-level incremental backups, which was bundled with pg probackup. However, that implementation of in-core Ptrack engine had a number of major drawbacks: * Requirement to keep a lot of additional files (one extra fork per relation); * Extremely invasive in-core changes; * Tricky workarounds to avoid races, when taking a backup. In this talk I am going to discuss block-level incremental backups and present a new incarnation of Ptrack — Ptrack 2.0. Being rewritten from the scratch it now uses a single shared hash table, which is mmap'ed in memory from the file on disk. All operations are made using atomics, so the map is completely lockless during the normal PostgreSQL operation. Ptrack map is written on disk at the end of checkpoint atomically block by block involving the CRC32 checksum calculation that is checked on the next whole map re-read after crash or restart. Due to the fixed size of the map there may be false positives (when some block is marked as changed without being actually modified), but not false negative results. This approach helps us to build simple, but yet durable and fast block-level incremental backups solution.

Seamless SQL optimization

2:15:26 3 Samokhvalov, Nikolay
There are two types of analysis and optimization. The first, macro-analysis, is analyzing the workload as a whole. Usually, it is done using pg stat statements or pgBadger The second one is micro-analysis and the central tool here is the EXPLAIN command. And there is a huge gap between them, partially covered by auto explain and pg qualstats extensions. In this tutorial, we learn how to establish a smooth and seamless SQL optimization process in your organization? Topics we'll cover: What are the pros and cons of using pg stat statements compared to log analysis performed by pgBadger? What are the key metrics in macro-analysis and how to choose the most applicable in each case (is it total time consumed by an SQL query group? or average timing, or maybe shared buffers hit and read by the query group per second?); Closing the gap: how to switch from macro-analysis to micro-analysis (you identified a "bad" SQL group, how to start optimizing it and what is needed to make this process faster / more automated?). How to simplify the process of adaption of using EXPLAIN command by a wide range of backend developers, what metrics matter and how (timing vs buffers involved). EXPLAIN visualization technics and their pros and cons (PEV, and FlameGraphs for EXPLAIN). How to accumulate knowledge about SQL optimization, share it with teammates and improve collaboration.

The "default" postgresql.conf, step by step

59:44 2 Kosmodemiansky, Ilya
If you ever wanted to find out what can be configured in postgresql.conf, you can easily find it in the official Postgres documentation. So, all you need to do is read it, yes, all of it, and you will have a perfectly configured database. Sounds easy, right? The reality, however, tends to differ from the laboratory conditions that documentation describes, in addition, there is never enough time to go through the documentation, especially, considering that not every parameter would make sense for your database. There are also settled differences that one should be aware of and the new releases that require constant adaptation of your config settings. In this tutorial, I will take you through all the settings that, in my experience, as a consultant working with a variety of databases, should be adjusted. I will look into reasoning for it, and review the chain reaction that each change in each of these settings, will trigger. We will review some typical workloads and I will also be sharing some recommended configurations which our DBAs follow when setting up for our clients and which have been proven over and over with different databases and under variety of our client’s requirements. Following this tutorial you will have the knowledge required to understand major postgresql.conf parameters, know what role they play in the overall database performance and will be able to set up your own version that will work for your database. If you are a DBA who is just starting to work with Postgres and would like to make sure that you have a reliable base to build your work on, this tutorial is for you.

The Way for Updating Materialized Views Rapidly

39:18 2 Nagata, Yugo et al.
Materialized views is a feature to store the results of view definition queries in DB in order to achieve faster query response. However, after base relations are modified, view maintenance is needed to keep the contents up to date. REFRESH MATERIALIZED VIEW command is prepared for the purpose, but this has to recompute the contents from a scratch, so this is not efficient in cases where only a small part of a base table is modified. Incremental View Maintenance (IVM) is a technique to maintain materialized views efficiently, which computes and applies only the incremental changes to the materialized views rather than recomputing. This feature is not implemented on PostgreSQL yet. We have proposed a patch to implement IVM on PostgreSQL and this is now under discussion. Since the first submission of the last year, we have made a great progress on this patch. For example, some aggregates, subqueries, self-join, and outer joins are supported now. These operations are important since they are commonly used in real applications. In this talk, we will explain the current status of our IVM implementation. The talk includes what problems are under the implementation, its solutions, what the current implementation can do, and what limitations and problems are left.

Time Series Databases: A Deeper Look

59:39 3 Eisentraut, Peter
The term "time series" is popular (again) in database circles. What is it and what's the point? Clearly, a traditional relational database like PostgreSQL can deal with time and with series. So why is time series a special use case? In this presentation, I want to look beyond the marketing a bit and analyze what the true technical characteristics of a time series database are and what use cases drive it. Then we can analyze how PostgreSQL handles these requirements, which workarounds are handy, and where improvements would be necessary. In PGCon tradition, this presentation is both guidance for users and a call to action for developers.

Toward full ACID distributed transaction support with Foreign Data Wrapper

34:02 1 Sawada, Masahiko
PostgreSQL has Foreign Data Wrapper feature and it is the powerful feature to access the distributed data across heterogenous data stores. FDW became writable at PostgreSQL 9.3 therefore PostgreSQL with FDW has potential to become distributed database supporting reads and writes. However one of the biggest missing piece is transaction management for distributed transactions. Currently atomicity and consistency of ACID properties are missing but are essential to achieve full ACID supported distributed transaction. Some proposals have been proposed but these are under discussion. This talks about the current status of FDW and problem regarding atomicity and isolation and introduce to the proposed solutions and other solutions employed by other distributed databases. Also I'll also explain the use cases like database sharding and federation.

What's Missing For Postgres Monitoring

38:26 4 Fittl, Lukas
Monitoring your Postgres database is important. But its actually not as easy as it sounds. Postgres itself does not expose enough information, or sometimes data is exposed that can be misinterpreted. In this session we'll spend time looking at specific cases that are blank spots on the map right now, where one either doesn't know what Postgres is up to, or where you have to use system-level tools and other creative methods to get information. Some examples of what we'll take a look at: The deceptive value of buffer hit counters, the multiple efforts to bring planning data to pg stat statements, and why you often have to resort to "perf" to identify performance bottlenecks. You'll leave this session with more ideas on how to work around these short-comings, as well as an understanding of where development effort is going to fix some of the issues.

work mem warriors

41:19 1 Davis, Jeff et al.
Recent work on the Postgres executor has made improvements to memory management -- from accounting for the memory used to responding to memory pressure. It is important to bind the memory usage of the database with the appropriate execution mechanisms and to choose those during planning based on cost in order to meet users' expectations and ensure predictable performance. This talk will cover three such improvements: the addition of memory accounting information to MemoryContexts, the memory-bounding of HashAgg (spilling HashAgg to disk), and the adaptive hashjoin fallback to nested hashloop join. The talk will also include an interactive session to solicit feedback from users on their expectations and experiences with work mem and the memory behavior of Postgres.

Zedstore: In-core Column Store for Postgres

42:52 3 Wang, Alexandra et al.
Zedstore is a column store for PostgreSQL that targets to reduce I/Os for queries that only need access to a few columns out of a wide table, to improve performance for analytics workloads, and to enable better compressions. It is under development and it uses the table AM APIs that were introduced in PostgreSQL V12. This talk will cover the use cases for column store in Postgres, the architecture and design decisions of Zedstore, performance and size comparisons between Zedstore and Heap tables for both OLTP and OLAP queries, changes required in the table AM APIs to accommodate table AMs other than heap, and the latest development status of Zedstore.

Files

20210702_03.md

Latest commit

History