go-dtxt
Package dtxt implements encoding and decoding of ASCII delimited text, for the Go programming language.
ASCII delimited text is similar to CSV, TSV, and other table & spreadsheet data formats.
Except that ASCII delimited text uses some of the deliminator control code characters that Unicode inherited from ASCII.
ASCII delimited text could also probably be validly called Unicode delimited text.
Especially when Unicode is encoded as UTF-8.
Documention
Online documentation, which includes examples, can be found at: http://godoc.org/github.com/reiver/go-dtxt
Encoding Example
This is a basic example of how to encode tabular data into ASCII delimited text using this package:
import "github.com/reiver/go-dtxt"
// ...
var writer io.Writer //@TODO: set to wherever you want the encoded ASCII Delimited Text data to go.
// ...
var encoder dtxt.Encoder = dtxt.EncoderWrap(writer)
err := encoder.Begin()
// ...
defer encoder.End()
// ...
// row 1
err := encode.EncodeRow("ONCE", '۱', "1", "Ⅰ", "یکی")
// ...
// row 2
err := encode.EncodeRow("TWICE", '۲', "2", "Ⅱ". "دو")
// ...
// row 3
err := encode.EncodeRow("THRICE", '۳', "3", "Ⅲ", "سه")
// ...
// row 3
err := encode.EncodeRow("FOURCE", '۴', "3", "Ⅳ", "چهار")
// ...
Decoding Example
This is a basic example of how to dencode tabular data from ASCII delimited text using this package.
In this example it is known ahead of time how many columns there are in the data.
import "github.com/reiver/go-dtxt"
// ...
var reader io.Reader //@TODO: set to wherever you want the encoded ASCII Delimited Text data to come from.
// ...
var decoder dtxt.Decoder = dtxt.WrapDecoder(reader)
// ...
for {
var key string
var value string
err := decoder.DecodeRow(&key, &value)
if dtxt.GS == err {
break
}
if nil != err {
return err
}
}
Deliminators
Unicode inherited 5 deliminator control code characters from ASCII:
Symbol |
Name |
Alternative Name |
Abbreviation |
Hexadecimal |
Decimal |
Caret |
UTF-8 |
␜ |
File Separator |
|
FS |
0x1c |
28 |
^\ |
0b00011100 |
␝ |
Group Separator |
Table Terminator |
GS |
0x1d |
29 |
^] |
0b00011101 |
␞ |
Row Separator |
Row Terminator |
RS |
0x1e |
30 |
^^ |
0b00011110 |
␟ |
Unit Separator |
Field Terminator |
US |
0x1f |
31 |
^_ |
0b00011111 |
␠ |
Space |
Word Separator |
SP |
0x20 |
32 |
^` |
0b00100000 |
Unit Separator (US) and Row Separator (RS) can be used to construct a table row.
For example, if we wanted to have a table row with 3 fields: “joe
”, “blow
”, and “root beer
”. I.e,. —
Then the result would be this:
const US = 0x1f
const RS = 0x1e
[]byte{
'j','o','e',
US, // ⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚ Field Terminator
'b','l','o','w',
US, // ⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚ Field Terminator
'r','o','o','t',' ','b','e','e','r',
US, // ⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚ Field Terminator
RS, // ⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚ Row Terminator
}
(Note this is just a single row.
And not a whole table.
A whole table would have a GS
control code character at the end of it.)
⚠️ Notice that we are using the US
control code characters in the Unix/Linux style — as a field terminator (and not just a field separator).
I.e., the last field gets a US
after it too.
⚠️ Notice also that we are using the RS
control code character in the Unix/Linux style too — as a row terminator (and not just a row separator).
I.e., the last row gets a RS
after it too.
Let's make it more obvious how RS
is used by showing a whole table encoded (and not just a row).
Let's encode this table:
|
|
|
joe |
blow |
root beer |
john |
doe |
caramel apple |
jane |
doe |
cotton candy |
const GS = 0x1d // table terminator
const RS = 0x1e // row terminator
const US = 0x1f // field terminator
[]byte{
'j','o','e',
US, // ⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚ Field Terminator
'b','l','o','w',
US, // ⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚ Field Terminator
'r','o','o','t',' ','b','e','e','r',
US, // ⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚ Field Terminator
RS, // ⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚ Row Terminator
'j','o','h','n',
US, // ⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚ Field Terminator
'd','o','e',
US, // ⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚ Field Terminator
'c','a','r','a','m','e','l',' ','a','p','p','l','e',
US, // ⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚ Field Terminator
RS, // ⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚ Row Terminator
'j','a','n','e',
US, // ⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚ Field Terminator
'd','o','e',
US, // ⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚ Field Terminator
'c','o','t','t','o','n',' ','c','a','n','d','y',
US, // ⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚ Field Terminator
RS, // ⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚ Row Terminator
GS, // ⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚ Table Terminator
}
⚠️ Notice that we are using the GS
control code characters in the Unix/Linux style — as a table terminator (and not just a table separator).
I.e., the last rows gets a GS
after it.
Escaping
One issue that can arise is — what if the data inside of a unit contains a Unit Separator (US), a Row Separator (RS), a Group Separator (GS), or a File Separator (FS)‽
How is that situation handled‽
The answer is that — Unicode inherited a control code character for escaping.
The aptly named Escape (ESC) control code character:
Name |
Abbreviation |
Hexadecimal |
Decimal |
Caret |
UTF-8 |
Escape |
ESC |
0x1b |
27 |
^[ |
0b00011011 |
An ESC chararacter is stuffed before any Escape (ESC), Unit Separator (US), Row Separator (RS), Group Separator (GS), or File Separator (FS) that appears inside of a unit.
Here is an example.
Let's say that we want to encode this table:
|
|
|
[]byte{'E','S','C'} |
[]byte{ESC} |
[]byte{'e','s','c','a','p','e'} |
[]byte{'F','S'} |
[]byte{FS} |
[]byte{'f','i','l','e',' ','t','e','r','m','i','n','a','t','o','r'} |
[]byte{'G','S'} |
[]byte{GS} |
[]byte{'t','a','b','l','e',' ','t','e','r','m','i','n','a','t','o','r'} |
[]byte{'R','S'} |
[]byte{RS} |
[]byte{'r','o','w',' ','t','e','r','m','i','n','a','t','o','r'} |
[]byte{'U','S'} |
[]byte{US} |
[]byte{'f','i','e','l','d',' ','t','e','r','m','i','n','a','t','o','r'} |
We would get:
const ESC = 0x1b // escape
const FS = 0x1c // file terminator
const GS = 0x1d // table terminator
const RS = 0x1e // row terminator
const US = 0x1f // field terminator
[]byte{
'E','S','C',
US, // ⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚ Field Terminator
ESC, // ⇚⇚⇚⇚⇚ Escape. Next character will be treated as data regardless of whether it is a control code character or not.
ESC,
US, // ⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚ Field Terminator
'e','s','c','a','p','e',
US, // ⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚ Field Terminator
RS, // ⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚ Row Terminator
'F','S',
US, // ⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚ Field Terminator
ESC, // ⇚⇚⇚⇚⇚ Escape. Next character will be treated as data regardless of whether it is a control code character or not.
FS,
US, // ⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚ Field Terminator
'f','i','l','e',' ','t','e','r','m','i','n','a','t','o','r',
US, // ⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚ Field Terminator
RS, // ⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚ Row Terminator
'G','S',
US, // ⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚ Field Terminator
ESC, // ⇚⇚⇚⇚⇚ Escape. Next character will be treated as data regardless of whether it is a control code character or not.
GS,
US, // ⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚ Field Terminator
't','a','b','l','e',' ','t','e','r','m','i','n','a','t','o','r',
US, // ⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚ Field Terminator
RS, // ⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚ Row Terminator
'R','S',
US, // ⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚ Field Terminator
ESC, // ⇚⇚⇚⇚⇚ Escape. Next character will be treated as data regardless of whether it is a control code character or not.
RS,
US, // ⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚ Field Terminator
'r','o','w',' ','t','e','r','m','i','n','a','t','o','r',
US, // ⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚ Field Terminator
RS, // ⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚ Row Terminator
'U','S',
US, // ⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚ Field Terminator
ESC, // ⇚⇚⇚⇚⇚ Escape. Next character will be treated as data regardless of whether it is a control code character or not.
US,
US, // ⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚ Field Terminator
'f','i','e','l','d',' ','t','e','r','m','i','n','a','t','o','r',
US, // ⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚ Field Terminator
RS, // ⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚ Row Terminator
GS, // ⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚⇚ Table Terminator
}