topologyaware

package
v0.2.1-0...-d3ae3b1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 4, 2020 License: Apache-2.0 Imports: 17 Imported by: 0

README

Topology-Aware Policy

Overview

The topology-aware builtin policy splits up the node into a tree of pools from which then resources are allocated to Containers. Currently the tree of pools is constructed automatically using runtime-discovered hardware topology information about the node. The pools correspond to the topologically relevant HW components: sockets, NUMA nodes, and CPUs/cores. The root of the tree corresponds to the full HW available in the system, the next level corresponds to individual sockets in the system, the next one to individual NUMA nodes.

The main goal of the topology-aware policy is to try and distribute Containers among the pools (tree nodes) in a way that both maximizes Container performance and minimizes interference between the Containers of different Pods. This is accomplished by considering

  • topological characteristics of the Container's devices (topology hints)
  • potential hints provided by the user (in the form of policy-specific annotations)
  • current availability of hardware resources
  • other colocated Containers running on the node

Features

  • aligning workload CPU and memory wrt. the locality of devices used
  • exclusive CPU allocation from pools
  • discovering and using kernel-isolated CPUs for exclusive allocations
  • shared CPU allocation from pools
  • mixed (both exclusive and shared) allocation from pools
  • exposing the allocated CPU to Containers
  • notifying Containers about changes in allocation

Activating the Topology-Aware Policy

You can activate the tpology-aware policy by setting the --policy option of cri-resmgr to topology-aware. For instance like this:

cri-resmgr --policy topology-aware --reserved-resources cpu=750m

Configuration

Commandline Options

There are a number of options specific to this policy:

  • --topology-aware-pin-cpu: Whether to pin Containers to the CPUs of the assigned pool.

  • --topology-aware-pin-memory: Whether to pin Containers to the memory of the assigned pool.

  • `--topology-aware-prefer-isolated-cpus: Whether to try to allocate kernel-isolated CPUs for exclusive usage unless the Pod or Container is explicitly annotated otherwise.

  • --topology-aware-prefer-shared-cpus: Whether to allocate shared CPUs unless the Pod or Container is explicitly annotated otherwise.

Dynamic Configuration

The topology-aware policy can be configured dynamically using the node agent. It takes a JSON configuration with the following keys corresponding to the above mentioned options:

  • PinCPU
  • PinMemory
  • PreferIsolatedCPUs
  • PreferSharedCPUs

See the documentation for information about dynamic configuration.

See the sample ConfigMap spec for an example which configures the topology-aware policy with the built-in defaults.

Container / Pod Allocation Policy Hints

The topology-aware policy recognizes a number of policy-specific annotations that can be used to provide hints and preferences about how resources should be allocated to the Containers. These hints are:

  • cri-resource-manager.intel.com/prefer-isolated-cpus: isolated exclusive CPU preference
  • cri-resource-manager.intel.com/prefer-shared-cpus: shared allocation preference
Isolated Exclusive CPUs

When kernel-isolated CPUs are available ,the topology-aware policy will prefer to allocate those to any Container of a Pod in the Guaranteed QoS class if the Container resource requirements ask for exactly 1 CPU. If multiple CPUs are requested, exlusive CPUs will be sliced off from the shared CPU set of the pool.

This default behavior can be changed using the --topology-aware-prefer-isolated-cpus boolean configuration option.

The global default behavior can also be overridden, per Pod or per Container, using the cri-resource-manager.intel.com/prefer-isolated-cpus annotation. Setting the value to true asks the policy to prefer isoalted CPUs for exclusive allocation even if the Container asks for multiple CPUs and only fall back to slicing off shared CPUs then there is insufficent free isolated capacity. Similarly, setting the value of the annotation to false opts out every Container in the Pod from taking any isolated CPUs.

The same mechanism can be used to opt-in or out of isolated CPU usage per Container within the Pod by setting the value of the annotation to the string represenation of a JSON object where each key is the name of a Container and each value is either true or false.

Shared CPU Allocation

The topology-aware policy assumes mixed mode exclusive+shared CPU allocation preference by default. Under those assumptions every Container of a Pod in the ´Guaranteed QoS classwill get exclusive CPUs allocated worth the integer part of theirCPU requestand a portion of the pool shared CPU set proportional to the fractional part of theirCPU request`. So for instance, a Container requesting 2.5 CPUs or 2500 milli-CPUs will get by default two exclusive CPUs allocated and half a CPU worth allocated from the pools CPU set shared with other Container in the same pool.

This default behavior can be changed using the --topology-aware-prefer-shared-cpus boolean configuration option.

Pods or Containers can opt-out of this assumption using the cri-resource-manager.intel.com/prefer-shared-cpus annotation. Setting its value to true will cause the policy to always allocate the entire requested capacity for all Containers of the Pod from the shared CPUs of a pool. Setting the value to false will cause the policy to allocate any integer portion of the CPU request exclusively and any fractional part from the shared CPUs.

The same thing can be accomplished per Container by using as value a JSON object similarly to the isolated CPU preference annotation: using the Container name as a key, and true or false as the value. Moreover, if a negative integer is used as the value, it is interpreted as true with a Container displacement upward in the tree. For instance, setting the annotation value to

  "{\"container-1\": -1, \"container-2\": true}" (or `0` instead of `true`)

requests container-1 to be placed to the parent of the pool with the best fitting score and container-2 to be placed in the best fitting pool itself.

Intra-Pod Container Affinity/Anti-affinity

Containers within a Pod can be annotated with affinity or anti-affinity rules, using the cri-resource-manager.intel.com/affinity and cri-resource-manager.intel.com/anti-affinity annotations.

Affinity indicates a soft pull preference while anti-affinity indicates a soft push preference. The topology-aware policy will try to colocate containers with affinity to the same pool and Containers with anti-affinity to different pools.

Here is an example snippet of a Pod Spec with

  • container3 having affinity to container1 and anti-affinity to container2,
  • container4 having anti-affinity to container2, and container3
  annotations:
    cri-resource-manager.intel.com/affinity: |
      container3: [ container1 ]
    cri-resource-manager.intel.com/anti-affinity: |
      container3: [ container2 ]
      container4: [ container2, container3 ]

This is actually a shorthand notation for the following, as key defaults to io.kubernetes.container.name, and operator defaults to In.

metadata:
  annotations:
    cri-resource-manager.intel.com/affinity: |+
      container3:
      - match:
          key: io.kubernetes.container.name
          operator: In
          values:
          - container1
    cri-resource-manager.intel.com/anti-affinity: |+
      container3:
      - match:
          key: io.kubernetes.container.name
          operator: In
          values:
          - container2
      container4:
      - match:
          key: io.kubernetes.container.name
          operator: In
          values:
          - container2
          - container3

Affinity and anti-affinity can have weights assigned as well. If omitted affinity weights default to 1 and anti-affinity weights to -1. The above example is actually represented internally with something equivalent to the following.

metadata:
  annotations:
    cri-resource-manager.intel.com/affinity: |+
      container3:
      - match:
          key: io.kubernetes.container.name
          operator: In
          values:
          - container1
        weight: 1
      - match:
          key: io.kubernetes.container.name
          operator: In
          values:
          - container2
        weight: -1
      container4:
      - match:
          key: io.kubernetes.container.name
          operator: In
          values:
          - container2
          - container3
        weight: -1

For a more detailed description see the documentation of annotations.

Documentation

Index

Constants

View Source
const (
	// PolicyName is the symbol used to pull us in as a builtin policy.
	PolicyName = "topology-aware"
	// PolicyDescription is a short description of this policy.
	PolicyDescription = "A policy for HW-topology aware workload placement."
	// PolicyPath is the path of this policy in the configuration hierarchy.
	PolicyPath = "policy." + PolicyName
)
View Source
const (
	IndentDepth = 4
)

indent produces an indentation string for the given level.

View Source
const (
	// OverfitPenalty is the per layer penalty for overfitting in the node tree.
	OverfitPenalty = 0.9
)

Variables

This section is empty.

Functions

func CreateTopologyAwarePolicy

func CreateTopologyAwarePolicy(opts *policyapi.BackendOptions) policyapi.Backend

CreateTopologyAwarePolicy creates a new policy instance.

Types

type CPUGrant

type CPUGrant interface {
	// GetContainer returns the container CPU capacity is granted to.
	GetContainer() cache.Container
	// GetNode returns the node that granted CPU capacity to the container.
	GetNode() Node
	// ExclusiveCPUs returns the exclusively granted non-isolated cpuset.
	ExclusiveCPUs() cpuset.CPUSet
	// SharedCPUs returns the shared granted cpuset.
	SharedCPUs() cpuset.CPUSet
	// SharedPortion returns the amount of CPUs in milli-CPU granted.
	SharedPortion() int
	// IsolatedCpus returns the exclusively granted isolated cpuset.
	IsolatedCPUs() cpuset.CPUSet
	// String returns a printable representation of this grant.
	String() string
}

CPUGrant represents CPU capacity allocated to a container from a node.

type CPURequest

type CPURequest interface {
	// GetContainer returns the container requesting CPU capacity.
	GetContainer() cache.Container
	// String returns a printable representation of this request.
	String() string

	// FullCPUs return the number of full CPUs requested.
	FullCPUs() int
	// CPUFraction returns the amount of fractional milli-CPU requested.
	CPUFraction() int
	// Isolate returns whether isolated CPUs are preferred for this request.
	Isolate() bool
	// Elevate returns the requested elevation/allocation displacement for this request.
	Elevate() int
}

CPURequest represents a CPU resources requested by a container.

type CPUScore

type CPUScore interface {
	// Calculate the actual score from the collected parameters.
	Eval() float64
	// CPUSupply returns the supply associated with this score.
	CPUSupply() CPUSupply
	// CPURequest returns the request associated with this score.
	CPURequest() CPURequest

	IsolatedCapacity() int
	SharedCapacity() int
	Colocated() int
	HintScores() map[string]float64

	String() string
}

CPUScore represents how well a supply can satisfy a request.

type CPUSupply

type CPUSupply interface {
	// GetNode returns the node supplying this capacity.
	GetNode() Node
	// Clone creates a copy of this CPUSupply.
	Clone() CPUSupply
	// IsolatedCPUs returns the isolated cpuset in this supply.
	IsolatedCPUs() cpuset.CPUSet
	// SharableCPUs returns the sharable cpuset in this supply.
	SharableCPUs() cpuset.CPUSet
	// Granted returns the locally granted capacity in this supply.
	Granted() int
	// Cumulate cumulates the given supply into this one.
	Cumulate(CPUSupply)
	// AccountAllocate accounts for (removes) allocated exclusive capacity from the supply.
	AccountAllocate(CPUGrant)
	// AccountRelease accounts for (reinserts) released exclusive capacity into the supply.
	AccountRelease(CPUGrant)
	// GetScore calculates how well this supply fits/fulfills the given request.
	GetScore(CPURequest) CPUScore
	// Allocate allocates CPU capacity from this supply and returns it as a grant.
	Allocate(CPURequest) (CPUGrant, error)
	// Release releases a previously allocated grant.
	Release(CPUGrant)
	// String returns a printable representation of this supply.
	String() string
}

CPUSupply represents avaialbe CPU capacity of a node.

type Node

type Node interface {
	// IsNil tests if this node is nil.
	IsNil() bool
	// Name returns the name of this node.
	Name() string
	// Kind returns the type of this node.
	Kind() NodeKind
	// NodeID returns the (enumerated) node id of this node.
	NodeID() int
	// Parent returns the parent node of this node.
	Parent() Node
	// Children returns the child nodes of this node.
	Children() []Node
	// LinkParent sets the given node as the parent node, and appends this node as a its child.
	LinkParent(Node)
	// AddChildren appends the nodes to the children, *WITHOUT* updating their parents.
	AddChildren([]Node)
	// IsSameNode returns true if the given node is the same as this one.
	IsSameNode(Node) bool
	// IsRootNode returns true if this node has no parent.
	IsRootNode() bool
	// IsLeafNode returns true if this node has no children.
	IsLeafNode() bool
	// Get the distance of this node from the root node.
	RootDistance() int
	// Get the height of this node (inverse of depth: tree depth - node depth).
	NodeHeight() int
	// System returns the policy sysfs instance.
	System() discoveredSystem
	// Policy returns the policy back pointer.
	Policy() *policy
	// DiscoverCPU
	DiscoverCPU() CPUSupply
	// GetCPU returns the full CPU at this node.
	GetCPU() CPUSupply
	// FreeCPU returns the available CPU supply of this node.
	FreeCPU() CPUSupply
	// GrantedCPU returns the amount of granted shared CPU capacity of this node.
	GrantedCPU() int
	// GetMemset
	GetMemset() system.IDSet
	// DiscoverMemset
	DiscoverMemset() system.IDSet
	// DepthFirst traverse the tree@node calling the function at each node.
	DepthFirst(func(Node) error) error
	// BreadthFirst traverse the tree@node calling the function at each node.
	BreadthFirst(func(Node) error) error
	// Dump state of the node.
	Dump(string, ...int)

	GetScore(CPURequest) CPUScore
	HintScore(system.TopologyHint) float64
	// contains filtered or unexported methods
}

Node is the abstract interface our partition tree nodes implement.

type NodeKind

type NodeKind string

NodeKind represents a unique node type.

const (
	// NilNode is the type of a nil node.
	NilNode NodeKind = ""
	// UnknownNode is the type of unknown node type.
	UnknownNode NodeKind = "unknown"
	// SocketNode represents a physical CPU package/socket in the system.
	SocketNode NodeKind = "socket"
	// NumaNode represents a NUMA node in the system.
	NumaNode NodeKind = "numa node"
	// VirtualNode represents a virtual node, currently the root multi-socket setups.
	VirtualNode NodeKind = "virtual node"
)

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL