mbox

package module
v1.0.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 1, 2023 License: MIT Imports: 7 Imported by: 1

README

Go Reference

This package provides a ScanMessage function that can be used with bufio.Scanner. This function splits data on RFC 4155 "default" separator lines, including the line in the data returned. The implementation could be more efficient, but it's pretty fast right now.

Since emails are often larger than bufio.Scanner's token size limit of 64kB, a custom mbox.Scanner is provided, with a tunable MaxTokenSize (which defaults to 64MB.) This Scanner is just a stripped down version of the standard library version with a few changes (see below.)

Example Usage

If you have any issues or questions, email the email address below, or open an issue at: https://github.com/korylprince/mbox/issues

Changes from Standard Library scan.go

  • Removed unneeded SplitFuncs
  • Moved maxConsecutiveEmptyReads into scan.go
  • Changed MaxScanTokenSize to 64MB from 64kB
  • Made Scanner.MaxTokenSize public
  • Changed the default ScanFunc for NewScanner to ScanMessage
  • Changed the default Scanner.buf size to 1MB from 4kB (size of buffer at initialization)

scan.go and scan_test.go are modified files from the main Go distribution and thus retain the Go Programming Language License.

Test data is taken from http://mailman.postel.org/pipermail/touch-mm.mbox/touch-mm.mbox

All other code is Copyright 2022 Kory Prince (korylprince at gmail dot com) and licensed under the LICENSE provided with the code.

Documentation

Index

Examples

Constants

View Source
const MaxScanTokenSize = 64 * 1024 * 1024

MaxScanTokenSize is the maximum size used to buffer a token. The actual maximum token size may be smaller as the buffer may need to include, for instance, a newline.

Variables

View Source
var (
	ErrTooLong         = errors.New("mbox.Scanner: token too long")
	ErrNegativeAdvance = errors.New("mbox.Scanner: SplitFunc returns negative advance count")
	ErrAdvanceTooFar   = errors.New("mbox.Scanner: SplitFunc returns advance count beyond input")
)

Errors returned by Scanner.

View Source
var ErrorUnexpectedEOF = fmt.Errorf("Expected separator line, Got EOF")

ErrorUnexpectedEOF signals that EOF was found before an expected separator line

Functions

func FindSeparator

func FindSeparator(data []byte) (idx, size int)

FindSeparator returns the start index and length of the first RFC 4155 "default" compliant separator line: `From <RFC 2822 "addr-spec"> <timestamp in UNIX ctime format><EOL marker>`. idx is negative when a separator line is not found

func ScanMessage

func ScanMessage(data []byte, atEOF bool) (advance int, token []byte, err error)

ScanMessage is a bufio.SplitFunc that splits the input on RFC 4155 "default" compliant separator lines returning the line with its message

Types

type Scanner

type Scanner struct {
	MaxTokenSize int // Maximum size of a token; modified by tests.
	// contains filtered or unexported fields
}

Scanner provides a simliar interface as bufio.Scanner, with the default SplitFunc set to SplitMessage

Example
package main

import (
	"bufio"
	"bytes"
	"fmt"
	"net/mail"
	"os"

	"github.com/korylprince/mbox"
)

func main() {
	f, err := os.Open("/path/to/mbox")
	if err != nil {
		// do something with err
	}
	defer f.Close()

	s := mbox.NewScanner(f)
	s.MaxTokenSize = 1024 * 1024 * 1024 // 1GB max size, or whatever you want
	for s.Scan() {
		b := s.Bytes()
		// copy bytes to buffer, otherwise they will be overwritten
		buf := make([]byte, len(b))
		copy(buf, b)
		r := bufio.NewReader(bytes.NewReader(b))

		_, _, err = r.ReadLine() // read in mbox separator line
		if err != nil {
			// do something with err
		}

		msg, err := mail.ReadMessage(r)
		if err != nil {
			// do something with err
		}

		// do something with msg
		fmt.Println(msg.Header)

		if err := s.Err(); err != nil {
			// do something with err
		}
	}
}
Output:

func NewScanner

func NewScanner(r io.Reader) *Scanner

NewScanner returns a new Scanner to read from r. The split function defaults to ScanLines.

func (*Scanner) Bytes

func (s *Scanner) Bytes() []byte

Bytes returns the most recent token generated by a call to Scan. The underlying array may point to data that will be overwritten by a subsequent call to Scan. It does no allocation.

func (*Scanner) Err

func (s *Scanner) Err() error

Err returns the first non-EOF error that was encountered by the Scanner.

func (*Scanner) Scan

func (s *Scanner) Scan() bool

Scan advances the Scanner to the next token, which will then be available through the Bytes or Text method. It returns false when the scan stops, either by reaching the end of the input or an error. After Scan returns false, the Err method will return any error that occurred during scanning, except that if it was io.EOF, Err will return nil. Scan panics if the split function returns 100 empty tokens without advancing the input. This is a common error mode for scanners.

func (*Scanner) Split

func (s *Scanner) Split(split bufio.SplitFunc)

Split sets the split function for the Scanner. If called, it must be called before Scan. The default split function is ScanLines.

func (*Scanner) Text

func (s *Scanner) Text() string

Text returns the most recent token generated by a call to Scan as a newly allocated string holding its bytes.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL