external

package
v0.4.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 28, 2020 License: Apache-2.0 Imports: 20 Imported by: 0

README

Integrating with Parsers of SQL Engines

The Challenge

Typical SQLFlow statements are like the following.

SELECT * FROM training_data TO TRAIN DNNClassifier LABEL kind INTO my_model;

SELECT * FROM testing_data TO PREDICT testing_data.predicted_kind USING my_model;

The point is, assuming someone already has a SELECT statement for data cleaning and augmentation, (s)he could add a TO TRAIN or TO PREDICT clause to enable AI, no matter how complex the statement is or if it is hundreds of lines of code including nested SELECT.

A significant challenge for SQLFlow here is to parse the SQL. Many SQL engines, such as MySQL, Oracle, Hive, SparkSQL, Flink, claim they are compatible with ANSI SQL but most have unique features.

Think about the following facts that (1) each parser might be very complicated, for example, MySQL's parser file sql_yacc.yy has >16,000 lines of code, and (2) some dialects have contradictory uses of the same keyword, it is intractable to merge multiple dialect parsers in the "SQLFlow parser" and keeps it compatible with the development of the dialects.

An alternative is to re-implement SQLFlow for each SQL engine as their extensions or UDF (user-defined function). This approach is equally intractable as there are so many engines to support.

Overall Design

We design SQLFlow as a wrapper of these SQL engines, which means that SQL statements come directly to SQLFlow and SQLFlow decides if it follows the dialect syntax of the specified SQL engine. If so, SQLFlow proxies the statement to the SQL engine, otherwise, SQLFlow translates it into a Python program, a.k.a., the submitter program, which calls the SQL engine and the AI engine to train or to predict.

With this design, the key challenge becomes to judge if a SQL statement is acceptable by a SQL engine's parser. We propose the following solution:

func Parse(sql string) (acceptable bool, err error) {
    // Where SQL is acceptable by the original engine, no error occurred.
    err := sql_engine.Parse(sql)
    if err == nil {
        return true, ni
    }

    // Error message from a parser should contain error position.
    pos := parseErrorPosition(err)
    leftPart = sql[:pos]
    rightPart = sql[pos:]

    errLeft := sql_engine.Parse(leftPart)
    // In this case, the SQL is not acceptable due to the syntax error
    if err != nil {
       return false, err
    }

    // If leftPart is acceptable, it is a legitimate  SELECT statement. We then try right part with SQLFlow parser.
    errRight := SQLFlow.Parse(rightPart)
    if err != nil {
        return false, errRight
    }

    // The left part is a SELECT and the right part is TO TRAIN or TO PREDICT.
    return false, nil
}

This procedure has two assumptions: (1) a SQL engine has a parser, denoted by sql_engine.Parse, and (2) the parser reports the position of syntax errors if there is any. Both assumptions are reasonable. However, a key question here is how to call parsers of various SQL engines. For open source SQL engines, this doesn't seem an issue. Here we list parsers used by some well-known engines:

SQL Engine Parser
Beam Calcite
Flink Calcite
Storm Calcite
Hive Hplsql.g4
MySQL sql_parse.{h,cc}

It is notable that MySQL doesn't provide sufficient documentation on how to call its parser, so we call a Go implementation of MySQL's SQL parser by PingCap https://github.com/pingcap/parser instead. For proprietary engines, we have a plan to ask the licensing of their parsers.

The External Parser Abstraction

The above example program calls sql_engine.Parse twice before calling SQLFlow.Parse. In practice, because Calcite parser is in Java, SQLFlow is a Go program, to enable SQLFlow calling Calcite parser, we have to wrap Calcite parser up into a command line tool which outputs the parsing result in JSON format. It is time-consuming to make a Java command line call, so we pack the two calls to sql_engine.Parse into one call. Each call will output and result in JSON format and SQLFlow can parse the output into the following Go struct.

type ParseResult struct {
	SQL      []string `json:"sql"`
	Position int      `json:"position"`
	Error    string   `json:"error"`
}

Some external parsers are in Go and don't need command line call. For example, TiDB parser is in Go, and Hive parser is a set of AntLR grammar rules, and AntLR can generate Go code. However, we would unify the interface to external parsers:

func external_parser(kind, sql string) ([]string, int, error) {
   switch(kind) {
   case "mysql":
       return tidb.Parse(sql)
   case "calcite":
       return calcite.Parse(sql)
   ...
   }
}

Now we can change the above example program to call external_parser and SQLFlow.Parse.

func Parse(sql_program string) (nodes, error) {
    allNodes := make([]nodes, 0)
    while len(sql_program) > 0 {
        // Start parsing by the third party parser
        nodes, err := parser.Parse(sql_program)
        if err != nil {
            // Error message from a parser should contain error position.
            pos := parseErrorPosition(err)
            leftPart = sql[:pos]
            rightPart = sql[pos:]

            nodes, errLeft := parser.Parse(leftPart)
            // In this case, the SQL is not acceptable due to the syntax error
            if errLeft != nil {
               return nil, err
            }

            // If leftPart is acceptable, it is a legitimate  SELECT statement.
            // We then try right part with SQLFlow parser using the extended syntax parser.
            node, errRight := esp.Parse(rightPart)
            if errRight != nil {
                return false, err
            }

            // Combine the select statement and the ML clause
            nodes[-1] = combineNode(nodes[-1], node)
        }
        allNodes = append(allNodes, nodes)
        sql_program = update(sql_program, nodes)

    return allNodes, nil
}

Documentation

Index

Constants

This section is empty.

Variables

View Source
var (
	// SelectCases is a list of SELECT statements that must pass
	// the parser.
	SelectCases = []string{
		`select a from b`,
		`select * from my_table`,
		`-- this is a comment
select
a from b`,
		`SELECT
    customerNumber,
    checkNumber,
    amount
FROM
    payments
WHERE
    amount = (SELECT MAX(amount) FROM payments)`,
		`SELECT
    orderNumber,
    SUM(priceEach * quantityOrdered) total
FROM
    orderdetails
        INNER JOIN
    orders USING (orderNumber)
GROUP BY orderNumber
HAVING SUM(priceEach * quantityOrdered) > 60000`,
		`SELECT
    customerNumber,
    customerName
FROM
    customers
WHERE
    EXISTS( SELECT
            orderNumber, SUM(priceEach * quantityOrdered)
        FROM
            orderdetails
                INNER JOIN
            orders USING (orderNumber)
        WHERE
            customerNumber = customers.customerNumber
        GROUP BY orderNumber
        HAVING SUM(priceEach * quantityOrdered) > 60000) LIMIT 0`,
		`select -- comment 1
* -- comment 2
from /* comment 3 */
table_a /* comment
4*/ -- end
`,
	}
)

Functions

This section is empty.

Types

type Parser

type Parser interface {
	Parse(program string) ([]*Statement, int, error)
}

Parser abstract a parser of a SQL engine, for example, Hive, MySQL, TiDB, MaxCompute.

func NewParser

func NewParser(dialect string) (Parser, error)

NewParser instantiates a parser.

type Statement

type Statement struct {
	String             string
	Inputs             []string
	Outputs            []string
	IsUnfinishedSelect bool
}

Statement a parsed SQL statement string and it's input tables and output tables.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL