Skip to content
This repository has been archived by the owner on Sep 27, 2019. It is now read-only.

Syntax-Based Query Rewriter, Code Review #2 #1495

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

esargent28
Copy link

New additions:

  • More rules (transitivity of tuple-tuple and tuple-constant expressions such as "A=B and B=3", boolean short-circuiting)
  • Basic catalog-based rules
  • More documentation

Our full design doc and review guide can be found here.

17zhangw and others added 9 commits April 1, 2019 18:53
- pattern
- rule
- ruleset
- group
- groupexpression
- binding
- memo
- optimize_context
- optimizer_task (TopDownRewrite/BottomUpRewrite)

Templates generally followed:
template <class Node, class OperatorType, class OperatorExpr>

The template instantiation associated with:
Node = Operator, OperatorType = OpType, OperatorExpr = OperatorExpression
is used primarily by the core Optimizer. All references to the templated
files/classes from core optimizer files were instantiated to that.

Note worth mentioning:
Operator class defines a public interface wrapper around BaseOperatorNode,
basically defines a single logical/physical operator.

OpType class defines the various logical/physical operations
OperatorExpression class is essentially a tree of Operator
Possibly annoying problems w.r.t Peloton/terrier:
(1) Use of unique_ptr/raw pointer as opposed to shared_ptr in AbstractExpression
(2) AbstractExpression equality comparison method

Additional components needed:
- Dynamic/template/strategy rule evaluation (particularly comparison)
- Repeated/multi-level application of rules
- Layer to convert from memo -> AbstractExpression
- Some refactoring w.r.t templated code
- Better AbsExpr_Container/Expression indirection layer
  (intended to present a similar interface exposed by
   Operator/OperatorExpression relied upon by core logic)
- Proper memory management strategy (tightly coupled to problem #1)
What still doesn't work/don't care about yet/not done
- proper memory management (terrier uses shared_ptr anyways)

- other 1-level rewrites, multi-layer rewrites, other expr rewrites

- how can we define a grammar to programmatically create these rewrites?
  (the one we have is way too static...)

- in relation to logical equivalence:
  (1) how do we generate logically equivalent expressions:
      - multi-pass using generating rules (similar to ApplyRule) OR
      - from Pattern A, generate logically equivalent set of patterns P OR
      - transform input expression to match certain specification OR
      - ???
  (2) what operators do we support translating?
      - probably (a AND b) ====> (b AND a)
      - probably (a OR b) ====> (b OR a)
      - probably (a = b) ====> (b = a)
      - maybe more???
  (3) do we want multi level translations?
      - i.e (a AND b) AND c ====> (a AND (b AND c))
      - what order do we do these in?
  May have to modify these operations:
  - Some assertions in TopDownRewrite/BottomUpRewrite w.r.t to the iterator
  - Possibly binding.cpp / optimizer_metadata.h / optimizer_task.cpp

Issues still pending:
- Comparing Values (Matt email/discussion)
- r.rule->Check (terrier issue cmu-db#332)
AbstractNode will provide interface for Operator and eventually
AbstractExpressions as well.

Note there are a few road blocks before the rest of the rewriter can be
changed to cleanly use abstract classes:
(1) Similarly abstract OperatorExpressions.
(2) We will have to find a good place to hide OpType, which is currently
an enum type (cannot be abstracted) and pervades the code base. This may
be solved by abstracting at the group level, but will have to look into
it.
(3) Need to clean up and separate interfaces between AbstractNode,
OperatorNode, and Operator classes.
Abstract nodes were implemented in 209c46a. This is essentially just
refactoring and plugging in abstract nodes throughout the optimizer.
The abstract interface exposes OpType and ExpressionType for now,
which ideally will be fixed later. Work remaining for abstracting
OperatorExpression.
Still need to make fixes to Pattern to support both OpType and ExpType
without templatizing. Will also need to clean up code after before build
will work.
@17zhangw 17zhangw force-pushed the equiv-experimental branch from c55babe to e3ac1ba Compare May 5, 2019 07:47
@@ -250,6 +252,21 @@ void BindNodeVisitor::Visit(expression::TupleValueExpression *expr) {
expr->SetColName(col_name);
expr->SetValueType(value_type);
expr->SetBoundOid(col_pos_tuple);

// TODO(esargent): Uncommenting the following code makes AddressSanitizer get mad at me with a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a good issue to file, so that we don't make the same mistake when we write a new binder.

//
// absexpr_expression.h
//
// Identification: src/include/optimizer/absexpr_expression.h
Copy link
Contributor

@lmwnshn lmwnshn May 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, if I understand correctly, it is more convenient for you to be operating on something that is not directly the parser's expressions. We are bringing in logical operators or something like that (PR 232) in terrier, which are the outputs of the binder and inputs to the optimizer. This seems like it might belong there, then. I see that you note this in your design doc.

class Rule {
public:
virtual ~Rule(){};

// TODO(ncx): pattern
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate on this TODO?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolving.

MEDIUM = 2,
LOW = 1
};

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a good place to use macros. All the header stubs look the same.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolving.


// Additional assertion checks for AddExpression() with AST rewriting
// TODO(ncx): get group expression type
// if (std::is_same<Node, AbsExpr_Container>::value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason these are currently commented out?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolving.

Copy link

@amlatyrngom amlatyrngom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding comments in .h files (like rule_rewrite.h) could be useful to someone just reviewing the code without going deep into the implementation.

//
// This is done to export the correct interface from the wrapped AbstractExpression
// to the rest of the core rule/optimizer code/logic.
class AbsExpr_Container: public AbstractNode {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow naming convention: AbsExprContainer.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolving.

return expr->GetExpressionName();
}

return "Undefined";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this case supposed to happen? If not, throw an exception instead?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolving.

};


class AbsExpr_Expression: public AbstractNodeExpression {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about the naming convention.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolving.

src/include/optimizer/abstract_node_expression.h Outdated Show resolved Hide resolved
void RewriteLoop(int root_group_id);

std::shared_ptr<AbsExpr_Expression> ConvertToAbsExpr(const expression::AbstractExpression *expr);
std::shared_ptr<GroupExpression> RecordTreeGroups(const expression::AbstractExpression *expr);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing documentation on these

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved.


DISALLOW_COPY_AND_MOVE(Rewriter);

OptimizerMetadata &GetMetadata() { return metadata_; }

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doxygen will complain about a lack of doxygen documentation on public functions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved.


~AbstractNode() {}

virtual void Accept(OperatorVisitor *v) const = 0;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation on this would be really useful

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved.

Rewriter *rewriter = new Rewriter();
auto rewrote = rewriter->RewriteExpression(common);

delete rewriter;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The convention that I think we usually use is to use smart pointers with a custom scope to achieve the same thing.

TEST_F(RewriterTests, SingleCompareEqualRewritePassTrue) {
type::Value leftValue = type::ValueFactory::GetIntegerValue(4);
type::Value rightValue = type::ValueFactory::GetIntegerValue(4);
auto left = new expression::ConstantValueExpression(leftValue);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Write in comments what this one is doing even though it may be relatively obvious.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved.

std::vector<std::shared_ptr<AbstractNodeExpression>> &transformed,
UNUSED_ATTRIBUTE OptimizeContext *context) const {
(void)transformed;
(void)context;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You already marked context as UNUSED_ATTRIBUTE on line 49. Take that macro out then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved.

@@ -19,6 +19,8 @@
#include "optimizer/rule.h"
#include "settings/settings_manager.h"

#include <memory>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some documentation of what OptimizerMetadata contains in optimizer_metadata would be nice as well.

@@ -111,6 +111,24 @@ class AbstractExpression : public Printable {
children_[index].reset(expr);
}

void ClearChildren() {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great documentation for this function and how it fits in with the query rewriting process as a whole. Is this function only used once during the rewriting phase for a single query? If not, have you guys looked at how the rewriter’s performance is affected?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is only used once during the rewriting phase in Peloton. As mentioned I believe in rewriter.cpp, this function is solely intended for ensuring that our rewriter manages state (i.e memory, pointers) correctly.

return false;
}

bool operator==(const AbsExpr_Container &r) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So does this == basically check if at least one of the AbsExpr_Container nodes is not a nullptr? Would it make sense to have this function call the more extensive checks on its contents? Or would this significantly affect your performance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We believe extensive checks on the contents are important for performance reasons but not necessary for correctness. Due to Peloton's design, we could not use operator== directly (i.e ConstantValueExpression does not check the Value). We believe "terrier" implements operator== correctly.

// GroupMarkerExpression
//===----------------------------------------------------------------------===//

class GroupMarkerExpression : public AbstractExpression {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some brief high-level documentation about GroupMarkerExpression and how it is used?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants