Skip to content
This repository has been archived by the owner on Sep 27, 2019. It is now read-only.

Modified pg_column_stats initialization #1352

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

poojanilangekar
Copy link
Contributor

@poojanilangekar poojanilangekar commented May 9, 2018

This PR modifies the ColumnStatsCatalog constructor to use a predefined schema instead of using DDL. Additionally, it creates the pg_column_stats table to a per database basis. Initially this was not done because of the dependencies of the StatsStorage on the view of "Global Stats".

Additionally, it changes the AnalyzeStatsForAllTables to AnalyzeStatsForAllTablesWithDatabaseOid. Now that we maintain ColumnStats on a per database basis, it makes sense to also collect the stats on a per database basis.

@apavlo
Copy link
Member

apavlo commented May 9, 2018

@poojanilangekar Nice!

@poojanilangekar
Copy link
Contributor Author

@mengranwo @GustavoAngulo We need to merge this in ASAP as its blocking the performance benchmarks. Please take a look at it when you get the time.

@coveralls
Copy link

coveralls commented May 18, 2018

Coverage Status

Coverage increased (+0.01%) to 77.565% when pulling 7dfbc59 on poojanilangekar:column_stats_fix into 5686479 on cmu-db:master.

@GustavoAngulo
Copy link
Member

I think it might be good to keep AnalyzeStatsForAllTables for now, my group in the class this semester noticed that we don't support the default ANALYZE behavior like Postgres does, where not specifying a table analyzes all tables. We might want to support this in the future, so that function/logic might come in handy.

As for the the schema, I'm not sure which inconsistency you're referring to @poojanilangekar ? @chenboy do you see an inconsistency?

@poojanilangekar
Copy link
Contributor Author

@GustavoAngulo I have modified the function to AnalyzeStatsForAllTablesWithDatabaseOid. I checked the Postgres documentation, the ANALYZE command is run on a per database basis. So I think this should solve the problem.

Regarding the inconsistency, the declaration creates two SKEY indexes and the table has no primary key. While the header file states that the table should contain only one index with the primary key.

@chenboy
Copy link
Contributor

chenboy commented May 25, 2018

The ColumnStatsCatalog use both indexes.

oid_t index_offset = IndexId::SECONDARY_KEY_0; // Secondary key index

oid_t index_offset = IndexId::SECONDARY_KEY_1; // Secondary key index

So I think we should stick to the declaration, not the header.

@poojanilangekar poojanilangekar requested a review from camellyx June 10, 2018 23:04
@poojanilangekar
Copy link
Contributor Author

@GustavoAngulo @camellyx Can you please review these changes?

@tli2 tli2 requested a review from pervazea June 18, 2018 14:43
@tli2
Copy link
Contributor

tli2 commented Jun 18, 2018

@pervazea Can you take a look since these other people are out of town?

@apavlo
Copy link
Member

apavlo commented Jun 21, 2018

This is an important PR that we should merge. We don't want to lose track of this.

Copy link
Contributor

@pervazea pervazea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.
Please add function headers / documentation as requested (and elsewhere too if I've missed anything).

Also, needs merge conflicts resolved :-( due to recent re-factors.


//===--------------------------------------------------------------------===//
// write Related API
//===--------------------------------------------------------------------===//
bool InsertColumnStats(oid_t database_id, oid_t table_id, oid_t column_id,
int num_rows, double cardinality, double frac_null,
bool InsertColumnStats(oid_t table_id, oid_t column_id, int num_rows,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment documenting function purpose and arguments.

std::string most_common_vals,
std::string most_common_freqs,
std::string histogram_bounds, std::string column_name,
bool has_index, type::AbstractPool *pool,
concurrency::TransactionContext *txn);
bool DeleteColumnStats(oid_t database_id, oid_t table_id, oid_t column_id,
bool DeleteColumnStats(oid_t table_id, oid_t column_id,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment documenting function purpose and arguments.

concurrency::TransactionContext *txn);

//===--------------------------------------------------------------------===//
// Read-only Related API
//===--------------------------------------------------------------------===//
std::unique_ptr<std::vector<type::Value>> GetColumnStats(
oid_t database_id, oid_t table_id, oid_t column_id,
concurrency::TransactionContext *txn);
oid_t table_id, oid_t column_id, concurrency::TransactionContext *txn);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment document function, args and return information.

@@ -59,6 +60,13 @@ class SystemCatalogs {
return pg_attribute_;
}

ColumnStatsCatalog *GetColumnStatsCatalog() {
if (!pg_column_stats_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a recoverable error? If not, this should be a PELOTON_ASSERT.

@@ -71,8 +67,8 @@ class StatsStorage {

/* Functions for triggerring stats collection */

ResultType AnalyzeStatsForAllTables(
concurrency::TransactionContext *txn = nullptr);
ResultType AnalyzeStatsForAllTablesWithDatabaseOid(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add function header describing this function, args, etc.

database_id, table_id, column_id, num_rows, cardinality, frac_null,
most_common_vals, most_common_freqs, histogram_bounds, column_name,
has_index, pool_.get(), txn);
pg_column_stats->DeleteColumnStats(table_id, column_id, txn);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeleteColumnStats if they exist, I assume. May be helpful to add comment stating that, if correct.

@@ -108,14 +110,18 @@ TEST_F(StatsStorageTests, InsertAndGetTableStatsTest) {
table_stats_collector.get());

VerifyAndPrintColumnStats(data_table.get(), 4);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, this function VerifyAndPrint... does print but very little verify. Acknowledge that this wasn't modified, but it does mean some of the old tests are not particularly useful.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants