half

package
v0.9.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 21, 2024 License: Apache-2.0 Imports: 3 Imported by: 0

Documentation

Overview

Package half defines support for half-precision floating-point numbers.

Index

Constants

View Source
const ErrInvalidNaNValue = float16Error("float16: invalid NaN value, expected IEEE 754 NaN")

ErrInvalidNaNValue indicates a NaN was not received.

Variables

This section is empty.

Functions

func BFloat16ToFloat32

func BFloat16ToFloat32(i uint16) float32

func BFloat16ToFloat64

func BFloat16ToFloat64(i uint16) float64

func Float64ToBFloat16

func Float64ToBFloat16(value float64) uint16

Types

type BFloat16

type BFloat16 uint16

The bfloat16 - Google 'brain' floating point format is a truncated 16-bit version of the IEEE 754 standard binary32. bfloat16 has approximately the same dynamic range as float32 (8 bits -> 3.4 × 10^38) by having a lower precision than float16. While float16 has a precision of 10 bits, bfloat16 has a precision of only 7 bits.

+------------+------------------------+----------------------------+ | 1-bit sign | 8-bit exponent (range) | 7-bit fraction (precision) | +------------+------------------------+----------------------------+

type Float16

type Float16 uint16

Float16 represents IEEE 754 half-precision floating-point numbers (binary16).

func FromNaN32ps

func FromNaN32ps(nan float32) (Float16, error)

FromNaN32ps converts nan to IEEE binary16 NaN while preserving both signaling and payload. Unlike Fromfloat32(), which can only return qNaN because it sets quiet bit = 1, this can return both sNaN and qNaN. If the result is infinity (sNaN with empty payload), then the lowest bit of payload is set to make the result a NaN. Returns ErrInvalidNaNValue and 0x7c01 (sNaN) if nan isn't IEEE 754 NaN. This function was kept simple to be able to inline.

func Frombits

func Frombits(u16 uint16) Float16

Frombits returns the float16 number corresponding to the IEEE 754 binary16 representation u16, with the sign bit of u16 and the result in the same bit position. Frombits(Bits(x)) == x.

func Fromfloat32

func Fromfloat32(f32 float32) Float16

Fromfloat32 returns a Float16 value converted from f32. Conversion uses IEEE default rounding (nearest int, with ties to even).

func Inf

func Inf(sign int) Float16

Inf returns a Float16 with an infinity value with the specified sign. A sign >= returns positive infinity. A sign < 0 returns negative infinity.

func NaN

func NaN() Float16

NaN returns a Float16 of IEEE 754 binary16 not-a-number (NaN). Returned NaN value 0x7e01 has all exponent bits = 1 with the first and last bits = 1 in the significand. This is consistent with Go's 64-bit math.NaN(). Canonical CBOR in RFC 7049 uses 0x7e00.

func (Float16) Bits

func (f Float16) Bits() uint16

Bits returns the IEEE 754 binary16 representation of f, with the sign bit of f and the result in the same bit position. Bits(Frombits(x)) == x.

func (Float16) Float32

func (f Float16) Float32() float32

Float32 returns a float32 converted from f (Float16). This is a lossless conversion.

func (Float16) IsFinite

func (f Float16) IsFinite() bool

IsFinite returns true if f is neither infinite nor NaN.

func (Float16) IsInf

func (f Float16) IsInf(sign int) bool

IsInf reports whether f is an infinity (inf). A sign > 0 reports whether f is positive inf. A sign < 0 reports whether f is negative inf. A sign == 0 reports whether f is either inf.

func (Float16) IsNaN

func (f Float16) IsNaN() bool

IsNaN reports whether f is an IEEE 754 binary16 “not-a-number” value.

func (Float16) IsNormal

func (f Float16) IsNormal() bool

IsNormal returns true if f is neither zero, infinite, subnormal, or NaN.

func (Float16) IsQuietNaN

func (f Float16) IsQuietNaN() bool

IsQuietNaN reports whether f is a quiet (non-signaling) IEEE 754 binary16 “not-a-number” value.

func (Float16) Signbit

func (f Float16) Signbit() bool

Signbit reports whether f is negative or negative zero.

func (Float16) String

func (f Float16) String() string

String satisfies the fmt.Stringer interface.

type Precision

type Precision int

Precision indicates whether the conversion to Float16 is exact, subnormal without dropped bits, inexact, underflow, or overflow.

const (

	// PrecisionExact is for non-subnormals that don't drop bits during conversion.
	// All of these can round-trip.  Should always convert to float16.
	PrecisionExact Precision = iota

	// PrecisionUnknown is for subnormals that don't drop bits during conversion but
	// not all of these can round-trip so precision is unknown without more effort.
	// Only 2046 of these can round-trip and the rest cannot round-trip.
	PrecisionUnknown

	// PrecisionInexact is for dropped significand bits and cannot round-trip.
	// Some of these are subnormals. Cannot round-trip float32->float16->float32.
	PrecisionInexact

	// PrecisionUnderflow is for Underflows. Cannot round-trip float32->float16->float32.
	PrecisionUnderflow

	// PrecisionOverflow is for Overflows. Cannot round-trip float32->float16->float32.
	PrecisionOverflow
)

func PrecisionFromfloat32

func PrecisionFromfloat32(f32 float32) Precision

PrecisionFromfloat32 returns Precision without performing the conversion. Conversions from both Infinity and NaN values will always report PrecisionExact even if NaN payload or NaN-Quiet-Bit is lost. This function is kept simple to allow inlining and run < 0.5 ns/op, to serve as a fast filter.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL