pospop

package module
v1.3.5 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 6, 2024 License: BSD-2-Clause Imports: 1 Imported by: 1

README

High-performance vectorised positional popcount routines for Go
===============================================================

This repository contains implementations of the positional population
count functions for Go.  Details on the algorithms used will be
published in a future research paper.

To use this library, import it as follows:

    import "github.com/clausecker/pospop"

You can then count populations using the Count8, Count16, Count32,
and Count64 functions:

    var counts [8]int
    pospop.Count8(&counts, buf)

The positional population count for buf is added to the contents of
counts.

Supported Platforms
-------------------

The kernels works on a block size of 240 or 480 bytes (depending on
whether AVX2 is available or not).  A buffer size that is a multiple
of 480 bytes and at least 10 kB in size is recommended.

Implementations are provided for the following SIMD extensions:

 * AVX-512 F/BW (amd64)
 * AVX2 (amd64, 386)
 * SSE2 (amd64, 386)
 * NEON (arm64)
 * generic kernel (all architectures)

The three kernels for amd64 correspond to the v4, v3, and v1 values
of the upcoming GOAMD64 environment variable.

Due to some required improvements in the assembler, the NEON kernel will
only be available on Go 1.16 or newer.  When building with earlier
versions of the tool chain, only the generic kernel is available.

The library automatically chooses the fastest available kernel for
the system it is running on.

Performance
-----------

As all functions
(Count8, Count16, Count32, Count64) of one set are based on the
same kernel with a different accumulation function, they all perform
equally well.  This does not apply to the generic implementations
whose performance is therefore given for every function individually.

The following performance table is grouped by the instruction set used
and the architecture it runs on.  A buffer size of 100 kB was used to
find these results.


		amd64		386		arm64		arm
avx512		82.1 GB/s	---		---		---
avx2		34.8 GB/s	31.6 GB/s	---		---
sse2		16.0 GB/s	15.6 GB/s	---		---
neon		---		---		36.9 GB/s	---
generic8	1.02 GB/s	297 MB/s	1.68 GB/s	49.0 MB/s
generic16	1.71 GB/s	1.36 GB/s	3.03 GB/s	67.1 MB/s
generic32	2.66 GB/s	2.21 GB/s	3.83 GB/s	105 MB/s
generic64	3.43 GB/s	1.89 GB/s	6.56 GB/s	82.9 MB/s

The following systems were used for benchmarks, all using Go 1.16:

 * amd64, 386: Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz
 * arm64: Apple M1
 * arm: ARM Cortex-A72 r0p3 (Raspberry Pi 4B)

Remaining Work
--------------

 * provide assembly kernels for arm, ppcle, and others
   (hardware donations appreciated for further targets)
 * provide variants of Count16, Count32, and Count64 working on byte
   arrays

(c) 2020--2024 Robert Clausecker <fuz@fuz.su>.  All Rights Reserved.

This code is published under a 2-clause BSD license.  See the file
COPYING for details.

Documentation

Overview

Positional population counts.

This package contains a set of functions to compute positional population counts for arrays of uint8, uint16, uint32, or uint64. Optimised assembly optimisations are provided for amd64 (AVX-512, AVX2, SSE2), 386 (AVX2, SSE2), and ARM64 (NEON). An optimal implementation constrainted by the instruction set extensions available on your CPU is chosen automatically at runtime. If no assembly implementation exists, a generic fallback implementation will be used. The pospop package thus works on all architectures supported by the Go toolchain.

The kernels works on a block size of 240, 480, or 960 bytes. A buffer size that is a multiple of 64 bytes and at least 10 kB in size is recommended. The author's benchmarks show that a buffer size around 100 kB appears optimal.

See the example on the Count8 function for what the positional population count operation does.

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

func Count16

func Count16(counts *[16]int, buf []uint16)

Count the number of corresponding set bits of the values in buf and add the results to counts. Each element of counts keeps track of a different place; counts[0] for 0x0001, counts[1] for 0x0002, and so on to counts[15] for 0x8000.

func Count32

func Count32(counts *[32]int, buf []uint32)

Count the number of corresponding set bits of the values in buf and add the results to counts. Each element of counts keeps track of a different place; counts[0] for 0x0000001, counts[1] for 0x00000002, and so on to counts[31] for 0x80000000.

func Count64

func Count64(counts *[64]int, buf []uint64)

Count the number of corresponding set bits of the values in buf and add the results to counts. Each element of counts keeps track of a different place; counts[0] for 0x000000000000001, counts[1] for 0x0000000000000002, and so on to counts[63] for 0x8000000000000000.

func Count8

func Count8(counts *[8]int, buf []uint8)

Count the number of corresponding set bits of the bytes in buf and add the results to counts. Each element of counts keeps track of a different place; counts[0] for 0x01, counts[1] for 0x02, and so on to counts[7] for 0x80.

Example

This example illustrates the positional population count operation. For each number in the input, Count8() checks which of its bits are set and increments the corresponding counters. In this example, four numbers (1, 3, 5, 9) have bit 0 set; three numbers (2, 3, 6) have bit 1 set, two numbers (5, 6) have bit 2 set and only the number 9 has bit 3 set.

var counts [8]int
numbers := []uint8{
	1, // bit 0 set
	2, // bit 1 set
	3, // bits 0 and 1 set
	5, // bits 0 and 2 set
	6, // bits 1 and 2 set
	9, // bits 0 and 3 set
}

Count8(&counts, numbers)
fmt.Println(counts)
Output:

[4 3 2 1 0 0 0 0]

Types

This section is empty.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL