parser

package
v0.5.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 28, 2024 License: Apache-2.0 Imports: 22 Imported by: 0

README

Generated Documentation and Embedded Help Texts

Many parts of the parser package include special consideration for generation or other production of user-facing documentation. This includes interactive help messages, generated documentation of the set of available functions, or diagrams of the various expressions.

Generated documentation is produced and maintained at compile time, while the interactive, contextual help is returned at runtime.

We equip the generated parser with the ability to report contextual help in two circumstances:

  • when the user explicitly requests help with the HELPTOKEN (current syntax: standalone "??")
  • when the user makes a grammatical mistake (e.g. INSERT sometable INTO(x, y) ...)

We use the docgen tool to produce the generated documentation files that are then included in the broader (handwritten) published documentation.

Help texts embedded in the grammar

The help is embedded in the grammar using special markers in yacc comments, for example:

// %Help: HELPKEY - shortdescription
// %Category: SomeCat
// %Text: whatever until next %marker at start of line, or non-comment.
// %SeeAlso: whatever until next %marker at start of line, or non-comment.
// %End (optional)

The "HELPKEY" becomes the map key in the generated Go map.

These texts are extracted automatically by help.awk and converted into a Go data structure in help_messages.go.

Support in the parser

Primary mechanism - LALR error recovery

The primary mechanism is leveraging error recovery in LALR parsers using the special error token [1] [2]: when an unexpected token is encountered, the LALR parser will pop tokens on the stack until the prefix matches a grammar rule with the special "error" token (if any). If such a rule exists, its action is used to reduce and the erroneous tokens are discarded.

This mechanism is used both when the user makes a mistake, and when the user inserts the HELPTOKEN in the middle of a statement. When present in the middle of a statement, HELPTOKEN is considered an error and triggers the error recovery.

We use this for contextual help by providing error rules that generate a contextual help text during LALR error recovery.

For example:

backup_stmt:
  BACKUP targets TO string_or_placeholder opt_as_of_clause opt_incremental opt_with_options
  {
    $$.val = &Backup{Targets: $2.targetList(), To: $4.expr(), IncrementalFrom: $6.exprs(), AsOf: $5.asOfClause(), Options: $7.kvOptions()}
  }
| BACKUP error { return helpWith(sqllex, `BACKUP`) }

In this example, the grammar specifies that if the BACKUP keyword is followed by some input tokens such that the first (valid) grammar rule doesn't apply, the parser will "recover from the error" by backtracking up until the point it only sees BACKUP on the stack followed by non-parsable tokens, at which points it takes the error rule and executes its action.

The action is return helpWith(...). What this does is:

  • halts parsing (the generated parser executes all actions in a big loop; a return interrupts this loop);
  • makes the parser return with an error (the helpWith function returns non-zero);
  • extends the parsing error message with a help text; this help text can subsequently be exploited in a client to display the help message in a friendly manner.
Code generation

Since the pattern "{ return helpWith(sqllex, ...) }" is common, we also implement a shorthand syntax based on comments, for example:

backup_stmt:
   ...
| BACKUP error // SHOW HELP: BACKUP

The special comment syntax "SHOW HELP: XXXX" is substituted by means of an auxiliary script (replace_help_rules.awk) into the form explained above.

Secondary mechanism - explicit help token

The mechanism described above works both when the user make a grammatical error and when they place the HELPTOKEN in the middle of a statement, rendering it invalid.

However for contextual help this is not sufficient: what happens if the user requests HELPTOKEN at a position in the grammar where everything before is a complete, valid SQL input?

For example: DELETE FROM foo ?

When encountering this input, the LALR parser will see DELETE FROM foo first, then reduce using the DELETE action because everything up to this point is a valid DELETE statement. When the HELPTOKEN is encountered, the statement has already been completed and the LALR parser doesn't 'know' any more that it was in the context of a DELETE statement.

If we try to place an error-based recovery rule at the top-level:

stmt:
  alter_stmt
| backup_stmt
| ...
| delete_stmt
| ...
| error { ??? }

This wouldn't work: the code inside the error action cannot "observe" the tokens observed so far and there would be no way to know whether the error should be about DELETE, or instead about ALTER, BACKUP, etc.

So in order to handle HELPTOKEN after a valid statement, we must place it in a rule where the context is still available, that is before the statement's grammar rule is reduced.

Where would that be? Suppose we had a simple statement rule:

somesimplestmt:
  SIMPLE DO SOMETHING { $$ = reduce(...) }
| SIMPLE error { help ... }

We could extend with:

somesimplestmt:
  SIMPLE DO SOMETHING { $$ = reduce(...) }
| SIMPLE DO SOMETHING HELPTOKEN { help ... }
| SIMPLE error { help ... }

(the alternative also works:

somesimplestmt:
  SIMPLE DO SOMETHING { $$ = reduce(...) }
| SIMPLE DO SOMETHING error { help ... }
| SIMPLE error { help ... }

)

That is all fine and dandy, but in SQL we have statements with many alternate forms, for example:

alter_rename_table_stmt:
  ALTER TABLE relation_expr RENAME TO qualified_name { ... }
| ALTER TABLE IF EXISTS relation_expr RENAME TO qualified_name { ... }
| ALTER TABLE relation_expr RENAME opt_column name TO name { ... }
| ALTER TABLE IF EXISTS relation_expr RENAME opt_column name TO name { ... }

To add complementary handling of the help token at the end of valid statements we could, but would hate to, duplicate all the rules:

alter_rename_table_stmt:
  ALTER TABLE relation_expr RENAME TO qualified_name { ... }
| ALTER TABLE relation_expr RENAME TO qualified_name HELPTOKEN { help ... }
| ALTER TABLE IF EXISTS relation_expr RENAME TO qualified_name { ... }
| ALTER TABLE IF EXISTS relation_expr RENAME TO qualified_name HELPTOKEN { help ... }
| ALTER TABLE relation_expr RENAME opt_column name TO name { ... }
| ALTER TABLE relation_expr RENAME opt_column name TO name HELPTOKEN { help ... }
| ALTER TABLE IF EXISTS relation_expr RENAME opt_column name TO name { ... }
| ALTER TABLE IF EXISTS relation_expr RENAME opt_column name TO name HELPTOKEN { help ... }

This duplication is horrendous (not to mention hard to maintain), so instead we should attempt to factor the help token in a context where it is still known that we are dealing just with that statement.

The following works:

alter_rename_table_stmt:
  real_alter_rename_table_stmt { $$ = $1 }
| real_alter_rename_table_stmt HELPTOKEN { help ... }

real_alter_rename_table_stmt:
  ALTER TABLE relation_expr RENAME TO qualified_name { ... }
| ALTER TABLE IF EXISTS relation_expr RENAME TO qualified_name { ... }
| ALTER TABLE relation_expr RENAME opt_column name TO name { ... }
| ALTER TABLE IF EXISTS relation_expr RENAME opt_column name TO name { ... }

Or does it? Without anything else, yacc complains with a "shift/reduce conflict". The reason is coming from the ambiguity: when the parsing stack contains everything sufficient to match a real_alter_rename_table_stmt, there is a choice between reducing the simple form alter_rename_table_stmt: real_alter_rename_table_stmt, or shifting into the more complex form alter_rename_table_stmt: real_alter_rename_table_stmt HELPTOKEN.

This is another form of the textbook situation when yacc is used to parse if-else statements in a programming language: the rule stmt: IF cond THEN body | IF cond THEN body ELSE body is ambiguous (and yields a shift/reduce conflict) for exactly the same reason.

The solution here is also straight out of a textbook: one simply informs yacc of the relative priority between the two candidate rules. In this case, when faced with a neutral choice, we encourage yacc to shift. The particular mechanism is to tell yacc that one rule has a higher priority than another.

It just so happens however that the yacc language only allows us to set relative priorites of tokens, not rules. And here we have a problem, of the two rules that need to be prioritized, only one has a token to work with (the one with HELPTOKEN). Which token should we prioritze for the other?

Conveniently yacc knows about this trouble and offers us an awkward, but working solution: we can tell it "use for this rule the same priority level as an existing token, even though the token is not part of the rule". The syntax for this is rule %prec TOKEN.

We can then use this as follows:

alter_rename_table_stmt:
  real_alter_rename_table_stmt           %prec LOWTOKEN { $$ = $1 }
| real_alter_rename_table_stmt HELPTOKEN %prec HIGHTOKEN { help ... }

We could create two new pseudo-tokens for this (called LOWTOKEN and HIGHTOKEN) however conveniently we can also reuse otherwise valid tokens that have known relative priorities. We settled in our case on VALUES (low priority) and UMINUS (high priority).

Code generation

With the latter mechanism presented above the pattern

rule:
  somerule           %prec VALUES
| somerule HELPTOKEN %prec UMINUS { help ...}`

becomes super common, so we automate it with the following special syntax:

rule:
  somerule // EXTEND WITH HELP: XXX

And the code replacement in replace_help_rules.awk expands this to the form above automatically.

Generated Documentation

Documentation of the SQL functions and operators is generated by the docgen utility, using make generate PKG=./docs/.... The markdown-formatted files are kept in docs/generated/sql and should be re-generated whenever the functions/operators they document change, and indeed if regenerating produces a diff, a CI failure is expected.

References

  1. https://www.gnu.org/software/bison/manual/html_node/Error-Recovery.html
  2. http://stackoverflow.com/questions/9796608/error-handling-in-yacc

Documentation

Index

Constants

This section is empty.

Variables

View Source
var AllHelp = func(h map[string]HelpMessageBody) string {

	cmds := make(map[string][]string)
	for c, details := range h {
		if details.Category == "" {
			continue
		}
		cmds[details.Category] = append(cmds[details.Category], c)
	}

	// Ensure the result is deterministic.
	var categories []string
	for c, l := range cmds {
		categories = append(categories, c)
		sort.Strings(l)
	}
	sort.Strings(categories)

	// Compile the final help index.
	var buf bytes.Buffer
	w := tabwriter.NewWriter(&buf, 0, 0, 1, ' ', 0)
	for _, cat := range categories {
		fmt.Fprintf(w, "%s:\n", cases.Title(language.English).String(cat))
		for _, item := range cmds[cat] {
			fmt.Fprintf(w, "\t\t%s\t%s\n", item, h[item].ShortDescription)
		}
		fmt.Fprintln(w)
	}
	_ = w.Flush()
	return buf.String()
}(helpMessages)

AllHelp contains an overview of all statements with help messages. For example, displayed in the CLI shell with \h without additional parameters.

View Source
var HelpMessages = func(h map[string]HelpMessageBody) map[string]HelpMessageBody {
	appendSeeAlso := func(newItem, prevItems string) string {

		if prevItems != "" {
			return newItem + "\n  " + prevItems
		}
		return newItem
	}
	reformatSeeAlso := func(seeAlso string) string {

		return strings.Replace(seeAlso, ", ", "\n  ", -1)
	}
	srcMsg := h["<SOURCE>"]
	srcMsg.SeeAlso = reformatSeeAlso(strings.TrimSpace(srcMsg.SeeAlso))
	selectMsg := h["<SELECTCLAUSE>"]
	selectMsg.SeeAlso = reformatSeeAlso(strings.TrimSpace(selectMsg.SeeAlso))
	for k, m := range h {
		m = h[k]
		m.ShortDescription = strings.TrimSpace(m.ShortDescription)
		m.Text = strings.TrimSpace(m.Text)
		m.SeeAlso = strings.TrimSpace(m.SeeAlso)

		if strings.Contains(m.Text, "<source>") && k != "<SOURCE>" {
			m.Text = strings.TrimSpace(m.Text) + "\n\n" + strings.TrimSpace(srcMsg.Text)
			m.SeeAlso = appendSeeAlso(srcMsg.SeeAlso, m.SeeAlso)
		}

		if strings.Contains(m.Text, "<selectclause>") && k != "<SELECTCLAUSE>" {
			m.Text = strings.TrimSpace(m.Text) + "\n\n" + strings.TrimSpace(selectMsg.Text)
			m.SeeAlso = appendSeeAlso(selectMsg.SeeAlso, m.SeeAlso)
		}

		if strings.Contains(m.Text, "<tablename>") {
			m.SeeAlso = appendSeeAlso("SHOW TABLES", m.SeeAlso)
		}
		m.SeeAlso = reformatSeeAlso(m.SeeAlso)
		h[k] = m
	}
	return h
}(helpMessages)

HelpMessages is the registry of all help messages, keyed by the top-level statement that they document. The key is intended for use via the \h client-side command.

Functions

func HasMultipleStatements

func HasMultipleStatements(sql string) bool

HasMultipleStatements returns true if the sql string contains more than one statements.

func LastLexicalToken

func LastLexicalToken(sql string) (lastTok int, ok bool)

LastLexicalToken returns the last lexical token. If the string has no lexical tokens, returns 0 and ok=false.

func ParseExpr

func ParseExpr(sql string) (tree.Expr, error)

ParseExpr is a short-hand for parseExprs([]string{sql})

func ParseExprs

func ParseExprs(sql []string) (tree.Exprs, error)

ParseExprs is a short-hand for parseExprs(sql)

func ParseQualifiedTableName

func ParseQualifiedTableName(sql string) (*tree.TableName, error)

ParseQualifiedTableName parses a SQL string of the form `[ database_name . ] [ schema_name . ] table_name`.

func ParseTableName

func ParseTableName(sql string) (*tree.UnresolvedObjectName, error)

ParseTableName parses a table name.

func ParseTableNameWithQualifiedNames

func ParseTableNameWithQualifiedNames(sql string) (*tree.UnresolvedObjectName, error)

ParseTableNameWithQualifiedNames can be used to parse an input table name that might be prefixed with an unquoted qualified name. The standard ParseTableName cannot do this due to limitations with our parser. In particular, the parser can't parse different productions individually -- it must parse them as part of a top level statement. This causes qualified names that contain keywords to require quotes, which are not required in some cases due to Postgres compatibility (in particular, as arguments to pg_dump). This function gets around this limitation by parsing the input table name as a column name with a fake non-keyword prefix, and then shifting the result down into an UnresolvedObjectName.

func ParseType

func ParseType(sql string) (tree.ResolvableTypeReference, error)

ParseType parses a column type.

func RunShowSyntax

func RunShowSyntax(
	ctx context.Context,
	stmt string,
	report func(ctx context.Context, field, msg string),
	reportErr func(ctx context.Context, err error),
)

RunShowSyntax analyzes the syntax and reports its structure as data for the client. Even an error is reported as data.

Since errors won't propagate to the client as an error, but as a result, the usual code path to capture and record errors will not be triggered. Instead, the caller can pass a reportErr closure to capture errors instead. May be nil.

func SplitFirstStatement

func SplitFirstStatement(sql string) (pos int, ok bool)

SplitFirstStatement returns the length of the prefix of the string up to and including the first semicolon that separates statements. If there is no semicolon, returns ok=false.

Types

type HelpMessage

type HelpMessage struct {
	// Command is set if the message is about a statement.
	Command string
	// Function is set if the message is about a built-in function.
	Function string

	// HelpMessageBody contains the details of the message.
	HelpMessageBody
}

HelpMessage describes a contextual help message.

func (*HelpMessage) Format

func (h *HelpMessage) Format(w io.Writer)

Format prints out details about the message onto the specified output stream.

func (*HelpMessage) String

func (h *HelpMessage) String() string

String implements the fmt.String interface.

type HelpMessageBody

type HelpMessageBody struct {
	Category         string
	ShortDescription string
	Text             string
	SeeAlso          string
}

HelpMessageBody defines the body of a help text. The messages are structured to facilitate future help navigation functionality.

type Parser

type Parser struct {
	// contains filtered or unexported fields
}

Parser wraps a scanner, parser and other utilities present in the parser package.

func (*Parser) Parse

func (p *Parser) Parse(sql string) (Statements, error)

Parse parses the sql and returns a list of statements.

func (*Parser) ParseWithInt

func (p *Parser) ParseWithInt(sql string, nakedIntType *types.T) (Statements, error)

ParseWithInt parses a sql statement string and returns a list of Statements. The INT token will result in the specified TInt type.

type Statement

type Statement struct {
	// AST is the root of the AST tree for the parsed statement.
	AST tree.Statement

	// SQL is the original SQL from which the statement was parsed. Note that this
	// is not appropriate for use in logging, as it may contain passwords and
	// other sensitive data.
	SQL string

	// NumPlaceholders indicates the number of arguments to the statement (which
	// are referenced through placeholders). This corresponds to the highest
	// argument position (i.e. the x in "$x") that appears in the query.
	//
	// Note: where there are "gaps" in the placeholder positions, this number is
	// based on the highest position encountered. For example, for `SELECT $3`,
	// NumPlaceholders is 3. These cases are malformed and will result in a
	// type-check error.
	NumPlaceholders int

	// NumAnnotations indicates the number of annotations in the tree. It is equal
	// to the maximum annotation index.
	NumAnnotations tree.AnnotationIdx
}

Statement is the result of parsing a single statement. It contains the AST node along with other information.

func ParseOne

func ParseOne(sql string) (Statement, error)

ParseOne parses a sql statement string, ensuring that it contains only a single statement, and returns that Statement. ParseOne will always interpret the INT and SERIAL types as 64-bit types, since this is used in various internal-execution paths where we might receive bits of SQL from other nodes. In general,earwe expect that all user-generated SQL has been run through the ParseWithInt() function.

type Statements

type Statements []Statement

Statements is a list of parsed statements.

func Parse

func Parse(sql string) (Statements, error)

Parse parses a sql statement string and returns a list of Statements.

func (Statements) String

func (stmts Statements) String() string

String returns the AST formatted as a string.

func (Statements) StringWithFlags

func (stmts Statements) StringWithFlags(flags tree.FmtFlags) string

StringWithFlags returns the AST formatted as a string (with the given flags).

type TokenString

type TokenString struct {
	TokenID int32
	Str     string
}

TokenString is the unit value returned by Tokens.

func Tokens

func Tokens(sql string) (tokens []TokenString, ok bool)

Tokens decomposes the input into lexical tokens.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL