Calcite, as the most commonly used SQL parsing engine in the field of big data, supports SQL parsing for large-scale projects such as Flink, hive, kylin, druid, etc., and if you want to study Flink SQL source code in depth, Calcite is also one of the necessary skills, which is worth learning
We also use its self-developed SQL engine internally to support correlation query of any number of heterogeneous data sources through a set of SQL (eg: mysql table join on the HBase table is doing an aggregation calculation).
Because calcite has more functions, this article mainly starts with the important main process source code of calcite, and mainly focuses
on sorting
out the stages of Calcite SQL execution
on the VolcanoPlanner optimizer
To sum up
,
1. Parse the incoming SQL into a lexical tree through the Parser parser, with SqlNode as the node of the tree
2. Do lexical validation validate, type verification, metadata verification, etc
. 3. Convert the verified SqlNode tree into the corresponding relational algebraic expression, which is also a tree, with RelNode as the node
4. The RelNode relational algebraic expression tree, through the built-in two optimizers Volcano, Hep optimizes the relational algebraic expression to obtain a tree of optimal logical algebra, which is also RelNode
5. The optimal logical algebraic expression (RelNode) will be converted into a corresponding executable physical execution plan (the conversion logic varies according to the framework), like Flink will be converted to its Operator
to
run to see each stage in detail
1. The stage of parsing SQL
statements into syntax trees (SQL – > SqlNode)
is
not actually implemented by calcite, but calcite itself defines a set of SQL syntax parsing rule templates, implemented through the framework of javaCC
Pull the code to take a look
The Parser.jj in the source code is the syntax template of the calcite core, for example, what syntax we want to add to flink sql such as count window will be modified here to define what
sql
token is the specific logic of how to return sqlNode
Take an example
"select ID, NAME from MYHBASE. MYHBASE where ID = '1' "
will
be parsed into such a sqlNode tree
Without going into detail here, javacc can refer to the official website (https://javacc.github.io/javacc/)
2 . Syntax validation validator stage
Here through the validator to calibrate, here is not expanded, not the point
3. Turn sqlNode into a logical expression tree for relNode (sqlNode – > relNode).
Here calcite has the default sql2rel converter org.apache.calcite.sql2rel.SqlToRelConverter
will
not be expanded here
4. Relational Algebraic Tree Optimization (relNode – > relNode)
is the focus of the medium emphasis here!!! The reason why so many frameworks choose Calcite is because its SQL optimization
passes through 3 stages we get a relNode tree, but this tree here is not the optimal solution, and Calcite gets an optimized best tree through its two optimizer planners
Here is the core of the whole calcite, calcite provides two optimizers
HepPlanner rule optimizer (simply understood as defining many rules Rule, As long as the tree nodes that can meet the optimization rules are converted according to the rules to get a rule-optimized tree, which is relatively simple
)
VolcanPanner cost optimizer (based on cost cost, the tree will iterate according to the rule, constantly calculating the generation value of the root relnode node to find the optimal tree)
Let’s look at
the select ID first, NAME from a where ID = '1'
So what does a SQL-converted RelNode tree look like
?
It can be seen that many nodes are named after Logical, because this is a logical node converted by calcite default converter (SqlToRelConverter) in stage 3, and logical nodes cannot run without physical properties
Next step into calcite’s cost optimizer VolcanoPlanner for optimization
returns the best cost Solve
the calcite optimization method
First calcite will set the relNode we got from the previous stage to the root of our cost Volcano optimizer and go
to it org.apache.calcite.plan.volcano.VolcanoPlanner.registerImpl() In method
The breakpoint will first register the relnode’s input
in the ensureStered method
during the register process
You can see that there is a loop back to the registerImpl() method, that
is
, the tree child nodes deep traversal is registered first
Next, let’s take a look at what the registration process
does after going back to the VolcanoPlanner.registerImpl() method that you just looked at since it is a deep traversal
You can see that the rule is about to be triggered, and here we have to intersperse a concept, Rule in calcite
From the class description, we can know that rules can convert one expression into another, what does it mean, let’s see what abstract methods there are
What does that mean? To sum up, the two core methods
matches() return whether the current relnode matches this rule
onMatch() When this rule is matched, this method is called, where the transformTo() method can be called, which converts one relNode into another
Rules are the core of the entire calcite, in fact, all SQL optimization is composed of corresponding rules, implement the SQL optimization logic as the corresponding rule, let the corresponding relNode tree nodes do the corresponding transformations to get the optimal best execution plan
OK back to our main process, Continue with the volcanoPlanner.fireRule() method above to see how the rule is triggered
The logic here is relatively simple, that is, when the relnode satisfies the rule, it calls the match() method of volcanoRuleCall
But one thing to note is that classOperands here contains relNode and all the rules that may match the relnode, and can be up or down
.
Let’s say I have a RelNode for a LogicFilter and then define two rules
, RuleA
operand(Logicalfilter.class, operand(TableScan.class))
RuleB
operand(Logicalproject.class, operand(Logicalfilter.class))
Then these two rules will enter the mapping relationship classOperands on this possible match, and
then continue to look at the code after matching the rule
Then I walked to the onMatch of VolcanoPlanner.DeferringRuleCall
Here is to add this rule to the
ruleQueue in IterativeRuleDriver, this queue is specially used to store the matched rules, it is not difficult to find that these matching rules only exist in the queue, but these rules have not been executed
So how long will it be executed?
Going back to the main process when all the relnode children in setRoot register
Will take the specific planner findBestExp() method, from the name can be seen to find the optimal
expression
here to say in advance, claicte’s optimization principle is that it assumes that if an expression is optimal, then its local is also optimal, then the current relNode best we only need to care, from
1. Add up all the best of the
child node
2. Add up all the rules that you can match, as well as the best of the remaining parts
What is obtained from this comparison is the optimal solution of the current relnode
Quote a diagram
If A can only match these two rules, then we only need to consider these few situations when enumerating the optimal solutions
for those who do not know much about the principle, you can take a look at this article https://io-meter.com/2018/11/01/sql-query-optimization-volcano/
Then look at findBestexp().
Here is the main loop of the whole optimization to find the optimal solution bestExp
,
constantly take the rule from the queue, run the rule, and do not exit until all the rules have been executed
That’s right, the queue here is mentioned earlier, when the default relnode is registered, it will put the matching rule in this queue and naturally
have a question here, as mentioned earlier, the relNode node will be changed when the rule is running, that is, the equivalent node of adding relndoe,
Then the structure
change of the tree here will cause the previous rule that cannot be matched to change the structure of the tree to match, then the rule that can be matched here will not be missed, then look at the method used to convert equivalent nodes in the onMatch() of the rule, transformTo().
The new node of the transformation is again executed in the transformTo method
That is to say, the new node will also go through, the default relNode registration process, when the new node is registered as an equivalent node will have a new rule matching, and will add this rule to the rulequeu to wait for the next execution of the
rule,
in addition, when this relnode node will be converted by the rule rule, The generated new relnode will be set to be added to the relnode’s equivalent node
Add an equivalent node, and in the propagateCostImprovement method
Calculating whether the current equivalent node will make the cost of the current relnode decrease, and if it falls, then update the best cost of the current relnode and bubble up to modify the optimal best cost of the parent relnode
while true Keep triggering the pull ruleQueue in
the rule until the rule is empty
and then the rule will add a
new equivalent node If the cost is better, Updating the best Relnode of the entire tree
The new equivalent relnode will match the new rule, and the new rule will be added to the rulequeue
Enter the next loop until there are no rules to match, so that bestexp can return the optimized
optimal relnode,
and then according to this optimal relnode, different frameworks translate into their own APIs
Calciet has finally finished speaking, and then you can start parsing the source code of Flink SQL
Local address: https://www.cnblogs.com/ljygz/p/15421973.html
end
public number (zhisheng) reply to Face, ClickHouse, ES, Flink, Spring, Java, Kafka, Monitoring < keywords such as span class="js_darkmode__148"> to view more articles corresponding to keywords.
like + Looking, less bugs 👇