confusables

package module
v0.0.0-...-3f3c236 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 15, 2020 License: BSD-2-Clause Imports: 9 Imported by: 0

README

confusables

GoDoc

Confusables is a library written in Golang to normalize Unicode strings, swapping out any potentially visually confusable characters (e.g. homoglyphs). It was inspired from a Python library of the same name.

Normalizing homoglyphs is useful for ensuring username uniqueness, finding malicious fake website names, detecting attempts to get past a profanity filter, and more.

In addition to swapping out homoglyphs, this library also uses norm under the hood to normalize strings using NFD, which fixes the problem of there being several Unicode ways to represent the same string. See this Go blog post for more details.

See below on why you should use this library to normalize Unicode strings. This library is not a complete solution for avoiding homoglyphs attacks; it is merely an effort to fix the "low hanging fruit". Pull requests are welcome.


Usage

import (
	"fmt"

	"github.com/Zamiell/confusables"
)

func main() {
	username1 := "Alice" // Uses all ASCII characters, like you would naively expect ("A" is 0x41).
	username2 := "Αlice" // Uses a Greek letter A (0x391).

	fmt.Println("Username 1 contains homoglyphs:", confusables.ContainsHomoglyphs(username1)) // Prints "false"
	fmt.Println("Username 2 contains homoglyphs:", confusables.ContainsHomoglyphs(username2)) // Prints "true"

	fmt.Println("No normalization - Usernames are equal:", username1 == username2) // Prints "false"
	username1 = confusables.Normalize(username1)
	fmt.Println("After normalization - Usernames are equal:", username1 == username2) // Prints "true"
}

The Unicode Problem with Uniqueness

Most websites and applications enforce case-insensitive username uniqueness. For example, if someone has already created an account with a username of "Alice", then others would be prevented from creating accounts with a username of "alice". This is common sense; allowing that kind of thing would just be confusing for everyone involved. Furthermore, it would be a security risk, because "alice" could impersonate "Alice", allowing for effective phishing attacks. Good thing for us, enforcing case-insensitive username uniqueness is relatively trivial (e.g. putting a case-insensitive UNIQUE constraint on a PostgreSQL username column, for example).

Unfortunately, enforcing case-insensitive username uniqueness is only the first step. Out of the 1+ million characters that Unicode provides, thousands of them are extremely similar to existing characters. For example, the normal capital A is equal to "0x41", the Greek letter "Α" is equal to "0xce 0x91", and the Cyrillic letter "А" is equal to "0xd0 0x90". This means that the impersonation problem from before has gotten a lot worse. Instead of "alice" impersonating "Alice", we now have "Αlice" (with a Greek letter Α) impersonating "Alice". These look-alike characters are called homoglyphs, and the various homoglyphs for the capital letter A are just the tip of the iceberg.

The naive solution to this problem is to forgo Unicode entirely, allowing only ASCII input for usernames. But this is a non-starter for any modern project. Even if your website or application is written in English, you can still probably expect to have users from around the world. Japanese users will want to use kanji, Russian users will want to use Cyrillic, and so forth.

Naturally, the people at The Unicode Consortium are also aware of this problem, describing it in detail in the Unicode Technical Report #36 (UTR #36). Notably, they provide "confusables.txt", a master list of all visually confusable characters in the Unicode spec. "confusables.txt" is handy because applications and libraries can use this list to determine if user-input contains any potentially misleading characters, or even to normalize a string. This library utilizes "confusables.txt" to do just that.

For more information, see this talk from The Tarquin at DEFCON 2018 about some different kinds of Unicode homograph attacks. (Unfortunately, he recommends OCR as a solution, which is expensive, complicated, and a bit overboard for some use-cases.)


Documentation

Index

Constants

View Source
const (
	ConfusablesFileName = "confusables.txt"
)

Variables

This section is empty.

Functions

func ContainsHomoglyphs

func ContainsHomoglyphs(s string) bool

func IndexOfFirstHomoglyph

func IndexOfFirstHomoglyph(s string) int

func Normalize

func Normalize(s string) string

Normalize returns a copy of a string that is: 1) Normalized with Normalization Form Canonical Decomposition (NFD). 2) Has common Unicode homoglyphs replaced with their more-standard versions.

Types

This section is empty.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL