sequence: github.com/strace/sequence Index | Files | Directories

package sequence

import "github.com/strace/sequence"

Sequence is a high performance sequential log scanner, analyzer and parser. It sequentially goes through a log message, parses out the meaningful parts, without the use regular expressions. It can parse over 100,000 messages per second without the need to separate parsing rules by log source type.

Documentation and other information are available at sequencer.io

Index

Package Files

analyzer.go config.go doc.go message.go parser.go reqmethods.go scanner.go sequence.go time.go tokens.go

Variables

var (
    TagTypesCount   int
    TokenTypesCount = int(token__END__) + 1
)
var (
    ErrNoMatch = errors.New("sequence: no pattern matched for this message")
)

func ReadConfig Uses

func ReadConfig(file string) error

type Analyzer Uses

type Analyzer struct {
    // contains filtered or unexported fields
}

Analyzer builds an analysis tree that represents all the Sequences from messages. It can be used to determine all of the unique patterns for a large body of messages.

It's based on a single basic concept, that for multiple log messages, if tokens in the same position shares one same parent and one same child, then the tokens in that position is likely variable string, which means it's something we can extract. For example, take a look at the following two messages:

Jan 12 06:49:42 irc sshd[7034]: Accepted password for root from 218.161.81.238 port 4228 ssh2
Jan 12 14:44:48 jlz sshd[11084]: Accepted publickey for jlz from 76.21.0.16 port 36609 ssh2

The first token of each message is a timestamp, and the 3rd token of each message is the literal "sshd". For the literals "irc" and "jlz", they both share a common parent, which is a timestamp. They also both share a common child, which is "sshd". This means token in between these, the 2nd token in each message, likely represents a variable token in this message type. In this case, "irc" and "jlz" happens to represent the syslog host.

Looking further down the message, the literals "password" and "publickey" also share a common parent, "Accepted", and a common child, "for". So that means the token in this position is also a variable token (of type TokenString).

You can find several tokens that share common parent and child in these two messages, which means each of these tokens can be extracted. And finally, we can determine that the single pattern that will match both is:

%time% %string% sshd [ %integer% ] : Accepted %string% for %string% from %ipv4% port %integer% ssh2

If later we add another message to this mix:

Jan 12 06:49:42 irc sshd[7034]: Failed password for root from 218.161.81.238 port 4228 ssh2

The Analyzer will determine that the literals "Accepted" in the 1st message, and "Failed" in the 3rd message share a common parent ":" and a common child "password", so it will determine that the token in this position is also a variable token. After all three messages are analyzed, the final pattern that will match all three messages is:

%time% %string% sshd [ %integer% ] : %string% %string% for %string% from %ipv4% port %integer% ssh2

func NewAnalyzer Uses

func NewAnalyzer() *Analyzer

func (*Analyzer) Add Uses

func (this *Analyzer) Add(seq Sequence) error

Add adds a single message sequence to the analysis tree. It will not determine if the tokens share a common parent or child at this point. After all the sequences are added, then Finalize() should be called.

func (*Analyzer) Analyze Uses

func (this *Analyzer) Analyze(seq Sequence) (Sequence, error)

Analyze analyzes the message sequence supplied, and returns the unique pattern that will match this message.

func (*Analyzer) Finalize Uses

func (this *Analyzer) Finalize() error

Finalize will go through the analysis tree and determine which tokens share common parent and child, merge all the nodes that share at least 1 parent and 1 child, and finally compact the tree and remove all dead nodes.

type Message Uses

type Message struct {
    Data string
    // contains filtered or unexported fields
}

func (*Message) Tokenize Uses

func (this *Message) Tokenize() (Token, error)

Scan is similar to Tokenize except it returns one token at a time

type Parser Uses

type Parser struct {
    // contains filtered or unexported fields
}

Parser is a tree-based parsing engine for log messages. It builds a parsing tree based on pattern sequence supplied, and for each message sequence, returns the matching pattern sequence. Each of the message tokens will be marked with the semantic tag types.

func NewParser Uses

func NewParser() *Parser

func (*Parser) Add Uses

func (this *Parser) Add(seq Sequence) error

Add will add a single pattern sequence to the parser tree. This effectively builds the parser tree so it can be used for parsing later. func (this *Parser) Add(s string) error {

func (*Parser) Parse Uses

func (this *Parser) Parse(seq Sequence) (Sequence, error)

Parse will take the message sequence supplied and go through the parser tree to find the matching pattern sequence. If found, the pattern sequence is returned. func (this *Parser) Parse(s string) (Sequence, error) {

type Scanner Uses

type Scanner struct {
    // contains filtered or unexported fields
}

Scanner is a sequential lexical analyzer that breaks a log message into a sequence of tokens. It is sequential because it goes through log message sequentially tokentizing each part of the message, without the use of regular expressions. The scanner currently recognizes time stamps, IPv4 addresses, URLs, MAC addresses, integers and floating point numbers.

For example, the following message

Jan 12 06:49:42 irc sshd[7034]: Failed password for root from 218.161.81.238 port 4228 ssh2

Returns the following Sequence:

Sequence{
	Token{TokenTime, TagUnknown, "Jan 12 06:49:42"},
	Token{TokenLiteral, TagUnknown, "irc"},
	Token{TokenLiteral, TagUnknown, "sshd"},
	Token{TokenLiteral, TagUnknown, "["},
	Token{TokenInteger, TagUnknown, "7034"},
	Token{TokenLiteral, TagUnknown, "]"},
	Token{TokenLiteral, TagUnknown, ":"},
	Token{TokenLiteral, TagUnknown, "Failed"},
	Token{TokenLiteral, TagUnknown, "password"},
	Token{TokenLiteral, TagUnknown, "for"},
	Token{TokenLiteral, TagUnknown, "root"},
	Token{TokenLiteral, TagUnknown, "from"},
	Token{TokenIPv4, TagUnknown, "218.161.81.238"},
	Token{TokenLiteral, TagUnknown, "port"},
	Token{TokenInteger, TagUnknown, "4228"},
	Token{TokenLiteral, TagUnknown, "ssh2"},
},

The following message

id=firewall time="2005-03-18 14:01:43" fw=TOPSEC priv=4 recorder=kernel type=conn policy=504 proto=TCP rule=deny src=210.82.121.91 sport=4958 dst=61.229.37.85 dport=23124 smac=00:0b:5f:b2:1d:80 dmac=00:04:c1:8b:d8:82

Will return

Sequence{
	Token{TokenLiteral, TagUnknown, "id"},
	Token{TokenLiteral, TagUnknown, "="},
	Token{TokenLiteral, TagUnknown, "firewall"},
	Token{TokenLiteral, TagUnknown, "time"},
	Token{TokenLiteral, TagUnknown, "="},
	Token{TokenLiteral, TagUnknown, "\""},
	Token{TokenTime, TagUnknown, "2005-03-18 14:01:43"},
	Token{TokenLiteral, TagUnknown, "\""},
	Token{TokenLiteral, TagUnknown, "fw"},
	Token{TokenLiteral, TagUnknown, "="},
	Token{TokenLiteral, TagUnknown, "TOPSEC"},
	Token{TokenLiteral, TagUnknown, "priv"},
	Token{TokenLiteral, TagUnknown, "="},
	Token{TokenInteger, TagUnknown, "4"},
	Token{TokenLiteral, TagUnknown, "recorder"},
	Token{TokenLiteral, TagUnknown, "="},
	Token{TokenLiteral, TagUnknown, "kernel"},
	Token{TokenLiteral, TagUnknown, "type"},
	Token{TokenLiteral, TagUnknown, "="},
	Token{TokenLiteral, TagUnknown, "conn"},
	Token{TokenLiteral, TagUnknown, "policy"},
	Token{TokenLiteral, TagUnknown, "="},
	Token{TokenInteger, TagUnknown, "504"},
	Token{TokenLiteral, TagUnknown, "proto"},
	Token{TokenLiteral, TagUnknown, "="},
	Token{TokenLiteral, TagUnknown, "TCP"},
	Token{TokenLiteral, TagUnknown, "rule"},
	Token{TokenLiteral, TagUnknown, "="},
	Token{TokenLiteral, TagUnknown, "deny"},
	Token{TokenLiteral, TagUnknown, "src"},
	Token{TokenLiteral, TagUnknown, "="},
	Token{TokenIPv4, TagUnknown, "210.82.121.91"},
	Token{TokenLiteral, TagUnknown, "sport"},
	Token{TokenLiteral, TagUnknown, "="},
	Token{TokenInteger, TagUnknown, "4958"},
	Token{TokenLiteral, TagUnknown, "dst"},
	Token{TokenLiteral, TagUnknown, "="},
	Token{TokenIPv4, TagUnknown, "61.229.37.85"},
	Token{TokenLiteral, TagUnknown, "dport"},
	Token{TokenLiteral, TagUnknown, "="},
	Token{TokenInteger, TagUnknown, "23124"},
	Token{TokenLiteral, TagUnknown, "smac"},
	Token{TokenLiteral, TagUnknown, "="},
	Token{TokenMac, TagUnknown, "00:0b:5f:b2:1d:80"},
	Token{TokenLiteral, TagUnknown, "dmac"},
	Token{TokenLiteral, TagUnknown, "="},
	Token{TokenMac, TagUnknown, "00:04:c1:8b:d8:82"},
}

func NewScanner Uses

func NewScanner() *Scanner

func (*Scanner) Scan Uses

func (this *Scanner) Scan(s string) (Sequence, error)

Scan returns a Sequence, or a list of tokens, for the data string supplied. Scan is not concurrent-safe, and the returned Sequence is only valid until the next time any Scan*() method is called. The best practice would be to create one Scanner for each goroutine.

func (*Scanner) ScanJson Uses

func (this *Scanner) ScanJson(s string) (Sequence, error)

ScanJson returns a Sequence, or a list of tokens, for the json string supplied. Scan is not concurrent-safe, and the returned Sequence is only valid until the next time any Scan*() method is called. The best practice would be to create one Scanner for each goroutine.

ScanJson flattens a json string into key=value pairs, and it performs the following transformation:

- all {, }, [, ], ", characters are removed
- colon between key and value are changed to "="
- nested objects have their keys concatenated with ".", so a json string like
		"userIdentity": {"type": "IAMUser"}
  will be returned as
		userIdentity.type=IAMUser
- arrays are flattened by appending an index number to the end of the key,
  starting with 0, so a json string like
		{"value":[{"open":"2014-08-16T13:00:00.000+0000"}]}
  will be returned as
		value.0.open = 2014-08-16T13:00:00.000+0000
- skips any key that has an empty value, so json strings like
		"reference":""		or		"filterSet": {}
  will not show up in the Sequence

type Sequence Uses

type Sequence []Token

Sequence represents a list of tokens returned from the scanner, analyzer or parser.

func (Sequence) PrintTokens Uses

func (this Sequence) PrintTokens() string

Longstring returns a multi-line representation of the tokens in the sequence

func (Sequence) Signature Uses

func (this Sequence) Signature() string

Signature returns a single line string that represents a common pattern for this types of messages, basically stripping any strings or literals from the message.

func (Sequence) String Uses

func (this Sequence) String() string

String returns a single line string that represents the pattern for the Sequence

type TagType Uses

type TagType int

TagType is the semantic representation of a token.

var (
    TagUnknown    TagType = 0
    TagMsgId      TagType // The message identifier
    TagMsgTime    TagType // The timestamp that’s part of the log message
    TagSeverity   TagType // The severity of the event, e.g., Emergency, …
    TagPriority   TagType // The pirority of the event
    TagAppHost    TagType // The hostname of the host where the log message is generated
    TagAppIP      TagType // The IP address of the host where the application that generated the log message is running on.
    TagAppVendor  TagType // The type of application that generated the log message, e.g., Cisco, ISS
    TagAppName    TagType // The name of the application that generated the log message, e.g., asa, snort, sshd
    TagSrcDomain  TagType // The domain name of the initiator of the event, usually a Windows domain
    TagSrcZone    TagType // The originating zone
    TagSrcHost    TagType // The hostname of the originator of the event or connection.
    TagSrcIP      TagType // The IPv4 address of the originator of the event or connection.
    TagSrcIPNAT   TagType // The natted (network address translation) IP of the originator of the event or connection.
    TagSrcPort    TagType // The port number of the originating connection.
    TagSrcPortNAT TagType // The natted port number of the originating connection.
    TagSrcMac     TagType // The mac address of the host that originated the connection.
    TagSrcUser    TagType // The user that originated the session.
    TagSrcUid     TagType // The user id that originated the session.
    TagSrcGroup   TagType // The group that originated the session.
    TagSrcGid     TagType // The group id that originated the session.
    TagSrcEmail   TagType // The originating email address
    TagDstDomain  TagType // The domain name of the destination of the event, usually a Windows domain
    TagDstZone    TagType // The destination zone
    TagDstHost    TagType // The hostname of the destination of the event or connection.
    TagDstIP      TagType // The IPv4 address of the destination of the event or connection.
    TagDstIPNAT   TagType // The natted (network address translation) IP of the destination of the event or connection.
    TagDstPort    TagType // The destination port number of the connection.
    TagDstPortNAT TagType // The natted destination port number of the connection.
    TagDstMac     TagType // The mac address of the destination host.
    TagDstUser    TagType // The user at the destination.
    TagDstUid     TagType // The user id that originated the session.
    TagDstGroup   TagType // The group that originated the session.
    TagDstGid     TagType // The group id that originated the session.
    TagDstEmail   TagType // The destination email address
    TagProtocol   TagType // The protocol, such as TCP, UDP, ICMP, of the connection
    TagInIface    TagType // The incoming TagTypeerface
    TagOutIface   TagType // The outgoing TagTypeerface
    TagPolicyID   TagType // The policy ID
    TagSessionID  TagType // The session or process ID
    TagObject     TagType // The object affected.
    TagAction     TagType // The action taken
    TagCommand    TagType // The command executed
    TagMethod     TagType // The method in which the action was taken, for example, public key or password for ssh
    TagStatus     TagType // The status of the action taken
    TagReason     TagType // The reason for the action taken or the status returned
    TagBytesRecv  TagType // The number of bytes received
    TagBytesSent  TagType // The number of bytes sent
    TagPktsRecv   TagType // The number of packets received
    TagPktsSent   TagType // The number of packets sent
    TagDuration   TagType // The duration of the session
)

func (TagType) String Uses

func (this TagType) String() string

func (TagType) TokenType Uses

func (this TagType) TokenType() TokenType

type Token Uses

type Token struct {
    Type  TokenType // Type is the type of token the Value represents.
    Tag   TagType   // Tag determines which tag the Value should be.
    Value string    // Value is the extracted string from the log message.
    // contains filtered or unexported fields
}

Token is a piece of information extracted from a log message. The Scanner will do its best to determine the TokenType which could be a time stamp, IPv4 or IPv6 address, a URL, a mac address, an integer or a floating point number. In addition, if the Scanner finds a token that's surrounded by %, e.g., %srcuser%, it will try to determine the correct tag type the token represents.

func (Token) String Uses

func (this Token) String() string

type TokenType Uses

type TokenType int

Tokentype is the lexical representation of a token.

const (
    TokenUnknown TokenType = iota // Unknown token
    TokenLiteral                  // Token is a fixed literal
    TokenTime                     // Token is a timestamp, in the format listed in TimeFormats
    TokenIPv4                     // Token is an IPv4 address, in the form of a.b.c.d
    TokenIPv6                     // Token is an IPv6 address
    TokenInteger                  // Token is an integer number
    TokenFloat                    // Token is a floating point number
    TokenURI                      // Token is an URL, in the form of http://... or https://...
    TokenMac                      // Token is a mac address
    TokenString                   // Token is a string that reprensents multiple possible values

)

func (TokenType) String Uses

func (this TokenType) String() string

Directories

PathSynopsis
cmd/sequenceSequence is a high performance sequential log scanner, analyzer and parser.

Package sequence imports 11 packages (graph). Updated 2018-02-15. Refresh now. Tools for package owners.