confusables

package module

v0.0.0-...-3f3c236 Latest Latest Go to latest Published: May 15, 2020 License: BSD-2-Clause Imports: 9 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

README ¶

confusables

Confusables is a library written in Golang to normalize Unicode strings, swapping out any potentially visually confusable characters (e.g. homoglyphs). It was inspired from a Python library of the same name.

Normalizing homoglyphs is useful for ensuring username uniqueness, finding malicious fake website names, detecting attempts to get past a profanity filter, and more.

In addition to swapping out homoglyphs, this library also uses norm under the hood to normalize strings using NFD, which fixes the problem of there being several Unicode ways to represent the same string. See this Go blog post for more details.

See below on why you should use this library to normalize Unicode strings. This library is not a complete solution for avoiding homoglyphs attacks; it is merely an effort to fix the "low hanging fruit". Pull requests are welcome.

Usage

import (
	"fmt"

	"github.com/Zamiell/confusables"
)

func main() {
	username1 := "Alice" // Uses all ASCII characters, like you would naively expect ("A" is 0x41).
	username2 := "Αlice" // Uses a Greek letter A (0x391).

	fmt.Println("Username 1 contains homoglyphs:", confusables.ContainsHomoglyphs(username1)) // Prints "false"
	fmt.Println("Username 2 contains homoglyphs:", confusables.ContainsHomoglyphs(username2)) // Prints "true"

	fmt.Println("No normalization - Usernames are equal:", username1 == username2) // Prints "false"
	username1 = confusables.Normalize(username1)
	fmt.Println("After normalization - Usernames are equal:", username1 == username2) // Prints "true"
}

The Unicode Problem with Uniqueness

Most websites and applications enforce case-insensitive username uniqueness. For example, if someone has already created an account with a username of "Alice", then others would be prevented from creating accounts with a username of "alice". This is common sense; allowing that kind of thing would just be confusing for everyone involved. Furthermore, it would be a security risk, because "alice" could impersonate "Alice", allowing for effective phishing attacks. Good thing for us, enforcing case-insensitive username uniqueness is relatively trivial (e.g. putting a case-insensitive UNIQUE constraint on a PostgreSQL username column, for example).

Unfortunately, enforcing case-insensitive username uniqueness is only the first step. Out of the 1+ million characters that Unicode provides, thousands of them are extremely similar to existing characters. For example, the normal capital A is equal to "0x41", the Greek letter "Α" is equal to "0xce 0x91", and the Cyrillic letter "А" is equal to "0xd0 0x90". This means that the impersonation problem from before has gotten a lot worse. Instead of "alice" impersonating "Alice", we now have "Αlice" (with a Greek letter Α) impersonating "Alice". These look-alike characters are called homoglyphs, and the various homoglyphs for the capital letter A are just the tip of the iceberg.

The naive solution to this problem is to forgo Unicode entirely, allowing only ASCII input for usernames. But this is a non-starter for any modern project. Even if your website or application is written in English, you can still probably expect to have users from around the world. Japanese users will want to use kanji, Russian users will want to use Cyrillic, and so forth.

Naturally, the people at The Unicode Consortium are also aware of this problem, describing it in detail in the Unicode Technical Report #36 (UTR #36). Notably, they provide "confusables.txt", a master list of all visually confusable characters in the Unicode spec. "confusables.txt" is handy because applications and libraries can use this list to determine if user-input contains any potentially misleading characters, or even to normalize a string. This library utilizes "confusables.txt" to do just that.

For more information, see this talk from The Tarquin at DEFCON 2018 about some different kinds of Unicode homograph attacks. (Unfortunately, he recommends OCR as a solution, which is expensive, complicated, and a bit overboard for some use-cases.)

Documentation ¶

Constants ¶

View Source

const (
	ConfusablesFileName = "confusables.txt"
)

Variables ¶

This section is empty.

Functions ¶

func ContainsHomoglyphs ¶

func ContainsHomoglyphs(s string) bool

func IndexOfFirstHomoglyph ¶

func IndexOfFirstHomoglyph(s string) int

func Normalize ¶

func Normalize(s string) string

Normalize returns a copy of a string that is: 1) Normalized with Normalization Form Canonical Decomposition (NFD). 2) Has common Unicode homoglyphs replaced with their more-standard versions.

Types ¶

This section is empty.

Source Files ¶

View all Source files

confusables.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL