half

package

v0.9.2 Latest Latest Go to latest Published: Apr 21, 2024 License: Apache-2.0 Imports: 3 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

git.andr3h3nriqu3s.com/andr3/gotch

Links

Open Source Insights

Documentation ¶

Overview ¶

Package half defines support for half-precision floating-point numbers.

Index ¶

Constants
func BFloat16ToFloat32(i uint16) float32
func BFloat16ToFloat64(i uint16) float64
func Float32ToBFloat16(value float32) uint16
func Float64ToBFloat16(value float64) uint16
type BFloat16
type Float16
type Precision
- func PrecisionFromfloat32(f32 float32) Precision

Constants ¶

View Source

const ErrInvalidNaNValue = float16Error("float16: invalid NaN value, expected IEEE 754 NaN")

ErrInvalidNaNValue indicates a NaN was not received.

Variables ¶

This section is empty.

Functions ¶

func BFloat16ToFloat32 ¶

func BFloat16ToFloat32(i uint16) float32

func BFloat16ToFloat64 ¶

func BFloat16ToFloat64(i uint16) float64

func Float32ToBFloat16 ¶

func Float32ToBFloat16(value float32) uint16

Ref.https://github.com/starkat99/half-rs/blob/cabfc74e2a48b44b4556780f9d1550dd50a708be/src/bfloat/convert.rs#L5C1-L24C1

func Float64ToBFloat16 ¶

func Float64ToBFloat16(value float64) uint16

Types ¶

type BFloat16 ¶

type BFloat16 uint16

The bfloat16 - Google 'brain' floating point format is a truncated 16-bit version of the IEEE 754 standard binary32. bfloat16 has approximately the same dynamic range as float32 (8 bits -> 3.4 × 10^38) by having a lower precision than float16. While float16 has a precision of 10 bits, bfloat16 has a precision of only 7 bits.

+------------+------------------------+----------------------------+ | 1-bit sign | 8-bit exponent (range) | 7-bit fraction (precision) | +------------+------------------------+----------------------------+

type Float16 ¶

type Float16 uint16

Float16 represents IEEE 754 half-precision floating-point numbers (binary16).

func FromNaN32ps ¶

func FromNaN32ps(nan float32) (Float16, error)

FromNaN32ps converts nan to IEEE binary16 NaN while preserving both signaling and payload. Unlike Fromfloat32(), which can only return qNaN because it sets quiet bit = 1, this can return both sNaN and qNaN. If the result is infinity (sNaN with empty payload), then the lowest bit of payload is set to make the result a NaN. Returns ErrInvalidNaNValue and 0x7c01 (sNaN) if nan isn't IEEE 754 NaN. This function was kept simple to be able to inline.

func Frombits ¶

func Frombits(u16 uint16) Float16

Frombits returns the float16 number corresponding to the IEEE 754 binary16 representation u16, with the sign bit of u16 and the result in the same bit position. Frombits(Bits(x)) == x.

func Fromfloat32 ¶

func Fromfloat32(f32 float32) Float16

Fromfloat32 returns a Float16 value converted from f32. Conversion uses IEEE default rounding (nearest int, with ties to even).

func Inf ¶

func Inf(sign int) Float16

Inf returns a Float16 with an infinity value with the specified sign. A sign >= returns positive infinity. A sign < 0 returns negative infinity.

func NaN ¶

func NaN() Float16

NaN returns a Float16 of IEEE 754 binary16 not-a-number (NaN). Returned NaN value 0x7e01 has all exponent bits = 1 with the first and last bits = 1 in the significand. This is consistent with Go's 64-bit math.NaN(). Canonical CBOR in RFC 7049 uses 0x7e00.

func (Float16) Bits ¶

func (f Float16) Bits() uint16

Bits returns the IEEE 754 binary16 representation of f, with the sign bit of f and the result in the same bit position. Bits(Frombits(x)) == x.

func (Float16) Float32 ¶

func (f Float16) Float32() float32

Float32 returns a float32 converted from f (Float16). This is a lossless conversion.

func (Float16) IsFinite ¶

func (f Float16) IsFinite() bool

IsFinite returns true if f is neither infinite nor NaN.

func (Float16) IsInf ¶

func (f Float16) IsInf(sign int) bool

IsInf reports whether f is an infinity (inf). A sign > 0 reports whether f is positive inf. A sign < 0 reports whether f is negative inf. A sign == 0 reports whether f is either inf.

func (Float16) IsNaN ¶

func (f Float16) IsNaN() bool

IsNaN reports whether f is an IEEE 754 binary16 “not-a-number” value.

func (Float16) IsNormal ¶

func (f Float16) IsNormal() bool

IsNormal returns true if f is neither zero, infinite, subnormal, or NaN.

func (Float16) IsQuietNaN ¶

func (f Float16) IsQuietNaN() bool

IsQuietNaN reports whether f is a quiet (non-signaling) IEEE 754 binary16 “not-a-number” value.

func (Float16) Signbit ¶

func (f Float16) Signbit() bool

Signbit reports whether f is negative or negative zero.

func (Float16) String ¶

func (f Float16) String() string

String satisfies the fmt.Stringer interface.

type Precision ¶

type Precision int

Precision indicates whether the conversion to Float16 is exact, subnormal without dropped bits, inexact, underflow, or overflow.

const (

	// PrecisionExact is for non-subnormals that don't drop bits during conversion.
	// All of these can round-trip.  Should always convert to float16.
	PrecisionExact Precision = iota

	// PrecisionUnknown is for subnormals that don't drop bits during conversion but
	// not all of these can round-trip so precision is unknown without more effort.
	// Only 2046 of these can round-trip and the rest cannot round-trip.
	PrecisionUnknown

	// PrecisionInexact is for dropped significand bits and cannot round-trip.
	// Some of these are subnormals. Cannot round-trip float32->float16->float32.
	PrecisionInexact

	// PrecisionUnderflow is for Underflows. Cannot round-trip float32->float16->float32.
	PrecisionUnderflow

	// PrecisionOverflow is for Overflows. Cannot round-trip float32->float16->float32.
	PrecisionOverflow
)

func PrecisionFromfloat32 ¶

func PrecisionFromfloat32(f32 float32) Precision

PrecisionFromfloat32 returns Precision without performing the conversion. Conversions from both Infinity and NaN values will always report PrecisionExact even if NaN payload or NaN-Quiet-Bit is lost. This function is kept simple to allow inlining and run < 0.5 ns/op, to serve as a fast filter.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL