ecsobserver

package module

v0.99.0 Latest Latest Go to latest Published: Apr 22, 2024 License: Apache-2.0 Imports: 23 Imported by: 5

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/open-telemetry/opentelemetry-collector-contrib

Links

Open Source Insights

README ¶

Amazon Elastic Container Service Observer

Status
Stability	beta
Distributions	contrib
Issues
Code Owners	@dmitryax, @rmfitzpatrick

The ecsobserver uses the ECS/EC2 API to discover prometheus scrape targets from all running tasks and filter them based on service names, task definitions and container labels.

NOTE: If you run collector as a sidecar, you should consider use ECS resource detector instead. However, it does not have service, EC2 instances etc. because it only queries local API.

Config

The configuration is based on existing cloudwatch agent ECS discovery . A full collector config looks like the following:

extensions:
  ecs_observer:
    refresh_interval: 60s # format is https://golang.org/pkg/time/#ParseDuration
    cluster_name: 'Cluster-1' # cluster name need manual config
    cluster_region: 'us-west-2' # region can be configured directly or use AWS_REGION env var
    result_file: '/etc/ecs_sd_targets.yaml' # the directory for file must already exists
    services:
      - name_pattern: '^retail-.*$'
    docker_labels:
      - port_label: 'ECS_PROMETHEUS_EXPORTER_PORT'
    task_definitions:
      - job_name: 'task_def_1'
        metrics_path: '/metrics'
        metrics_ports:
          - 9113
          - 9090
        arn_pattern: '.*:task-definition/nginx:[0-9]+'

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: "ecs-task"
          file_sd_configs:
            - files:
                - '/etc/ecs_sd_targets.yaml' # MUST match the file name in ecs_observer.result_file
          relabel_configs: # Relabel here because label with __ prefix will be dropped by receiver.
            - source_labels: [ __meta_ecs_cluster_name ] # ClusterName
              action: replace
              target_label: ClusterName
            - source_labels: [ __meta_ecs_service_name ] # ServiceName
              action: replace
              target_label: ServiceName
            - action: labelmap # Convert docker labels on container to metric labels
              regex: ^__meta_ecs_container_labels_(.+)$ # Capture the key using regex, e.g. __meta_ecs_container_labels_Java_EMF_Metrics -> Java_EMF_Metrics
              replacement: '$$1'

processors:
  batch:

# Use awsemf for CloudWatch Container Insights Prometheus. The extension does not have requirement on exporter.
exporters:
  awsemf:

service:
  pipelines:
    metrics:
      receivers: [ prometheus ]
      processors: [ batch ]
      exporters: [ awsemf ]
  extensions: [ ecs_observer ]

Name		Description
cluster_name	Mandatory	target ECS cluster name for service discovery
cluster_region	Mandatory	target ECS cluster's AWS region name
refresh_interval	Optional	how often to look for changes in endpoints (default: 10s)
result_file	Mandatory	path of YAML file to write scrape target results. NOTE: the observer always returns empty in initial implementation
services	Optional	list of service name patterns detail
task_definitions	Optional	list of task definition arn patterns detail
docker_labels	Optional	list of docker labels detail

Output configuration

result_file specifies where to write the discovered targets. It MUST match the files defined in file_sd_configs for prometheus receiver. See output format for the detailed format.

Filters configuration

There are three type of filters, and they share the following common optional properties.

job_name
metrics_path
metrics_ports an array of port number

Example

ecs_observer:
  job_name: 'ecs-sd-job'
  services:
    - name_pattern: ^retail-.*$
      container_name_pattern: ^java-api-v[12]$
    - name_pattern: game
      metrics_path: /v3/343
      job_name: guilty-spark
  task_definitions:
    - arn_pattern: '*memcached.*'
    - arn_pattern: '^proxy-.*$'
      metrics_ports:
        - 9113
        - 9090
      metrics_path: /internal/metrics
  docker_labels:
    - port_label: ECS_PROMETHEUS_EXPORTER_PORT
    - port_label: ECS_PROMETHEUS_EXPORTER_PORT_V2
      metrics_path_label: ECS_PROMETHEUS_EXPORTER_METRICS_PATH

ECS Service Name based filter Configuration

Name		Description
name_pattern	Mandatory	Regex pattern to match against ECS service name
metrics_ports	Mandatory	container ports separated by semicolon. Only containers that expose these ports will be discovered
container_name_pattern	Optional	ECS task container name regex pattern

ECS Task Definition based filter Configuration

Name		Description
arn_pattern	Mandatory	Regex pattern to match against ECS task definition ARN
metrics_ports	Mandatory	container ports separated by semicolon. Only containers that expose these ports will be discovered
container_name_pattern	Optional	ECS task container name regex pattern

Docker Label based filter Configuration

Specify label keys to look up value

Name		Description
port_label	Mandatory	container's docker label name that specifies the metrics port
metrics_path_label	Optional	container's docker label name that specifies the metrics path. (Default: "")
job_name_label	Optional	container's docker label name that specifies the scrape job name. (Default: "")

Authentication

It uses the default credential chain, on ECS it is advised to use ECS task role. You need to deploy the collector as an ECS task/service with the following permissions .

EC2 access is required for getting private IP for ECS EC2. However, EC2 permission can be removed if you are only using Fargate because task ip comes from awsvpc instead of host.

ec2:DescribeInstances
ecs:ListTasks
ecs:ListServices
ecs:DescribeContainerInstances
ecs:DescribeServices
ecs:DescribeTasks
ecs:DescribeTaskDefinition

Discovery mechanism

The extension polls ECS API periodically to get all running tasks and filter out scrape targets. There are 3 types of filters for discovering targets, targets match the filter are kept. Targets from different filters are merged base on address/metrics_path before updating/creating receiver.

ECS Service Name based filter

ECS Service is a deployment that manages multiple tasks with same definition (like Deployment and DaemonSet in k8s).

The service configuration matches both service name and container name (if not empty).

NOTE: name of the service is added as label value with key ServiceName.

# Example 1: Matches all containers that are started by retail-* service
name_pattern: ^retail-.*$
---
# Example 2: Matches all container with name java-api in cash-app service 
name_pattnern: ^cash-app$
container_name_pattern: ^java-api$
---
# Example 3: Override default metrics_path (i.e. /metrics)
name_pattern: ^log-replay-worker$
metrics_path: /v3/metrics

ECS Task Definition based filter

ECS task definition contains one or more containers (like Pod in k8s). Long running applications normally uses service. while short running (batch) jobs can be created from task definitions directly .

The task definition matches both task definition name and container name (of not empty). Optional config like metrics_path, metrics_ports, job_name can override default value.

# Example 1: Matches all the tasks created from task definition that contains memcached in its arn
arn_pattern: "*memcached.*"

Docker Label based filter

Docker label can be specified in task definition. Only port_label is used when checking if the container should be included. Optional config like metrics_path_label, job_name_label can override default value.

# Example 1: Matches all the container that has label ECS_PROMETHEUS_EXPORTER_PORT_NGINX
port_label: 'ECS_PROMETHEUS_EXPORTER_PORT_NGINX'
---
# Example 2: Override job name based on label MY_APP_JOB_NAME
port_label: 'ECS_PROMETHEUS_EXPORTER_PORT_MY_APP'
job_name_label: 'MY_APP_JOB_NAME'

Notify Prometheus Receiver of discovered targets

There are three ways to notify a receiver

Use file based service discovery in prometheus config and updates the file.
Use receiver creator framework to create a new receiver for new endpoints.
Register as a prometheus discovery plugin.

Generate target file for file based discovery

Status: implemented

This is current approach used by cloudwatch-agent and also recommended by prometheus . It's easier to debug and the main drawback is it only works for prometheus. Another minor issue is fsnotify may not work properly occasionally and delay the update.

Receiver creator framework

Status: pending

This is a generic approach that creates a new receiver at runtime based on discovered endpoints. The main problem is performance issue as described in this issue.

Register as prometheus discovery plugin

Status: pending

Because both the collector and prometheus is written in Go, we can call discover.RegisterConfig to make it a valid config for prometheus (like other in tree plugins like kubernetes). The drawback is the configuration is now under prometheus instead of extension and can cause confusion.

Output Format

Example in unit test.

The format is based on cloudwatch agent , ec2 sd and kubernetes sd. Task and labels from task definition are always included. EC2 info is only included when task is running on ECS EC2 ( i.e. not on Fargate).

Unlike cloudwatch agent, all the additional labels starts with __meta_ecs_ prefix. If they are not renamed during relabel, they will all get dropped in prometheus receiver and won't pass down along the pipeline.

The number of dimensions supported by AWS EMF exporter is limited by its backend. The labels can be modified/filtered at different stages, prometheus receiver relabel, Metrics Transform Processor and EMF exporter Metric Declaration

Essential Labels

Required for prometheus to scrape the target.

Label Name	Source	Type	Description
`__address__`	ECS Task and TaskDefinition	string	`host:port` `host` is private ip from ECS Task, `port` is the mapped port
`__metrics_path__`	ECS TaskDefinition or Config	string	Default is `/metrics`, changes based on config/label
`job`	ECS TaskDefinition or Config	string	Name for scrape job

Additional Labels

Additional information from ECS and EC2.

Label Name	Source	Type	Description
`__meta_ecs_task_definition_family`	ECS TaskDefinition	string	Name for registered task definition
`__meta_ecs_task_definition_revision`	ECS TaskDefinition	int	Version of the task definition being used to run the task
`__meta_ecs_task_launch_type`	ECS Task	string	`EC2` or `FARGATE`
`__meta_ecs_task_group`	ECS Task	string	Task Group is `service:my-service-name` or specified when launching task directly
`__meta_ecs_task_tags_<tagkey>`	ECS Task	string	Tags specified in CreateService and RunTask
`__meta_ecs_task_container_name`	ECS Task	string	Name of container
`__meta_ecs_task_container_label_<labelkey>`	ECS TaskDefinition	string	Docker label specified in task definition
`__meta_ecs_task_health_status`	ECS Task	string	`HEALTHY` or `UNHEALTHY`. `UNKNOWN` if not configured
`__meta_ecs_ec2_instance_id`	EC2	string	EC2 instance id for `EC2` launch type
`__meta_ecs_ec2_instance_type`	EC2	string	EC2 instance type e.g. `t3.medium`, `m6g.xlarge`
`__meta_ecs_ec2_tags_<tagkey>`	EC2	string	Tags specified when creating the EC2 instance
`__meta_ecs_ec2_vpc_id`	EC2	string	ID of VPC e.g. `vpc-abcdefeg`
`__meta_ecs_ec2_private_ip`	EC2	string	Private IP
`__meta_ecs_ec2_public_ip`	EC2	string	Public IP, if available

Serialization

Labels, all the label value are encoded as string. (e.g. strconv.Itoa(123)).
Go struct, all the non string types are converted. labels and tags are passed as map[string]string instead of []KeyValue
Prometheus target, each taget

// PrometheusECSTarget contains address and labels extracted from a running ECS task 
// and its underlying EC2 instance (if available).
// 
// For serialization
// - FromLabels and ToLabels converts it between map[string]string.
// - FromTargetYAML and ToTargetYAML converts it between prometheus file discovery format in YAML. 
// - FromTargetJSON and ToTargetJSON converts it between prometheus file discovery format in JSON. 
type PrometheusECSTarget struct {
	Address                string            `json:"address"`
	MetricsPath            string            `json:"metrics_path"`
	Job                    string            `json:"job"`
	TaskDefinitionFamily   string            `json:"task_definition_family"`
	TaskDefinitionRevision int               `json:"task_definition_revision"`
	TaskLaunchType         string            `json:"task_launch_type"`
	TaskGroup              string            `json:"task_group"`
	TaskTags               map[string]string `json:"task_tags"`
	ContainerName          string            `json:"container_name"`
	ContainerLabels        map[string]string `json:"container_labels"`
	HealthStatus           string            `json:"health_status"`
	EC2InstanceId          string            `json:"ec2_instance_id"`
	EC2InstanceType        string            `json:"ec2_instance_type"`
	EC2Tags                map[string]string `json:"ec2_tags"`
	EC2VPCId               string            `json:"ec2_vpc_id"`
	EC2PrivateIP           string            `json:"ec2_private_ip"`
	EC2PublicIP            string            `json:"ec2_public_ip"`
}

Delta

Delta is not supported because there is no watch API in ECS (unlike k8s, see known issues). The output always contains all the targets. Caller/Consumer need to implement their own logic to calculate the targets diff if they only want to process new targets.

Known issues

There is no list watch API in ECS (unlike k8s), and we fetch ALL the tasks and filter it locally. If the poll interval is too short or there are multiple instances doing discovery, you may hit the (undocumented) API rate limit. In memory caching is implemented to reduce calls for task definition and ec2.
A single collector may not be able to handle a large cluster, you can use hashmod in relabel_config to do static sharding. However, too many collectors may trigger the rate limit on AWS API as each shard is fetching ALL the tasks during discovery regardless of number of shards.

Implementation

The implementation has two parts, core ecs service discovery logic and adapter for notifying discovery results.

Packages

extension/observer/ecsobserver main logic
internal/ecsmock mock ECS cluster
internal/errctx structured error wrapping

Flow

The pseudocode showing the overall flow.

NewECSSD() {
  session := awsconfig.NewSssion()
  ecsClient := awsecs.NewClient(session)
  filters := config.NewFileters()
  decorator := awsec2.NewClient(session)
  for {
    select {
    case <- timer:
      // Fetch ALL
      tasks := ecsClient.FaetchAll()
      // Filter
      filteredTasks := fileters.Apply(tasks)
      // Add EC2 info
      decorator.Apply(filteredTask)
      // Generate output
      if writeResultFile {
         writeFile(fileteredTasks, /etc/ecs_sd.yaml)
      } else {
          notifyObserver()
      }
    }
  }
}

Metrics

Following metrics are logged at debug level. TODO(pingleig): Is there a way for otel plugins to export custom metrics to otel's own /metrics.

Name	Type	Description
`discovered_targets`	int	Number of targets exported
`discovered_taskss`	int	Number of tasks that contains scrape target, should be smaller than targets unless each task only contains one target
`ignored_tasks`	int	Tasks ignored by filter, `discovered_tasks` and `ignored_tasks` should add up to `api_ecs_list_task_results`, one exception is API paging failed in the middle
`targets_matched_by_service`	int	ECS Service name based filter
`targets_matched_by_task_definition`	int	ECS TaskDefinition based filter
`targets_matched_by_docker_label`	int	ECS DockerLabel based filter
`target_error_noip`	int	Export failures because private ip not found
`api_ecs_list_task_results`	int	Total number of tasks returned from ECS ListTask API
`api_ecs_list_service_results`	int	Total number of services returned from ECS ListService API
`api_error_auth`	int	Total number of error triggered by permission
`api_error_rate_limit`	int	Total number of error triggered by rate limit
`cache_size_container_instances`	int	Cached ECS ContainerInstance
`cache_hit_container_instance`	int	Cache hit during the latest polling
`cache_size_ec2_instance`	int	Cached EC2 Instance
`cache_hit_ec2_instance`	int	Cache hit during the latest polling

Error Handling

Auth and cluster not found error will cause the extension to stop (calling ReportStatus). Although IAM role can be updated at runtime without restarting the collector, it's better to fail to make the problem obvious. Same applies to cluster not found. In the future we can add config to downgrade those errors if user want to monitor an ECS cluster with collector running outside the cluster, the collector can run anywhere as long as it can reach scrape targets and AWS API.
If we have non-critical error, we overwrite existing file with whatever targets we have, we might not have all the targets due to throttle etc.

Unit Test

A mock ECS and EC2 server is in internal/ecsmock, see fetcher_test for its usage.

Integration Test

Will be implemented in AOT Testing Framework to run against actual ECS service on both EC2 and Fargate.

Changelog

2021-06-02 first version that actually works on ECS by @pingleig, thanks @anuraaga @Aneurysm9 @jrcamp @mxiamxia for reviewing (all the PRs ...)
2021-02-24 Updated doc by @pingleig
2020-12-29 Initial implementation by Raphael in #1920

Documentation ¶

Index ¶

func NewFactory() extension.Factory
type CommonExporterConfig
type Config
- func (c *Config) Validate() error
type DockerLabelConfig
type ServiceConfig
type TaskDefinitionConfig

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func NewFactory ¶

func NewFactory() extension.Factory

NewFactory creates a factory for ECSObserver extension.

Types ¶

type CommonExporterConfig ¶

type CommonExporterConfig struct {
	JobName      string `mapstructure:"job_name" yaml:"job_name"`
	MetricsPath  string `mapstructure:"metrics_path" yaml:"metrics_path"`
	MetricsPorts []int  `mapstructure:"metrics_ports" yaml:"metrics_ports"`
}

CommonExporterConfig should be embedded into filter config. They set labels like job, metrics_path etc. that can override prometheus default.

type Config ¶

type Config struct {

	// ClusterName is the target ECS cluster name for service discovery.
	ClusterName string `mapstructure:"cluster_name" yaml:"cluster_name"`
	// ClusterRegion is the target ECS cluster's AWS region.
	ClusterRegion string `mapstructure:"cluster_region" yaml:"cluster_region"`
	// RefreshInterval determines how frequency at which the observer
	// needs to poll for collecting information about new processes.
	RefreshInterval time.Duration `mapstructure:"refresh_interval" yaml:"refresh_interval"`
	// ResultFile is the output path of the discovered targets YAML file (optional).
	// This is mainly used in conjunction with the Prometheus receiver.
	ResultFile string `mapstructure:"result_file" yaml:"result_file"`
	// JobLabelName is the override for prometheus job label, using `job` literal will cause error
	// in otel prometheus receiver. See https://github.com/open-telemetry/opentelemetry-collector/issues/575
	JobLabelName string `mapstructure:"job_label_name" yaml:"job_label_name"`
	// Services is a list of service name patterns for filtering tasks.
	Services []ServiceConfig `mapstructure:"services" yaml:"services"`
	// TaskDefinitions is a list of task definition arn patterns for filtering tasks.
	TaskDefinitions []TaskDefinitionConfig `mapstructure:"task_definitions" yaml:"task_definitions"`
	// DockerLabels is a list of docker labels for filtering containers within tasks.
	DockerLabels []DockerLabelConfig `mapstructure:"docker_labels" yaml:"docker_labels"`
}

func (*Config) Validate ¶ added in v0.28.0

func (c *Config) Validate() error

Validate overrides the embedded noop validation so that load config can trigger our own validation logic.

type DockerLabelConfig ¶

type DockerLabelConfig struct {
	CommonExporterConfig `mapstructure:",squash" yaml:",inline"`

	// PortLabel is mandatory, empty string means docker label based match is skipped.
	PortLabel        string `mapstructure:"port_label" yaml:"port_label"`
	JobNameLabel     string `mapstructure:"job_name_label" yaml:"job_name_label"`
	MetricsPathLabel string `mapstructure:"metrics_path_label" yaml:"metrics_path_label"`
}

DockerLabelConfig matches all tasks based on their docker label.

NOTE: it's possible to make DockerLabelConfig part of CommonExporterConfig and use it both ServiceConfig and TaskDefinitionConfig. However, based on existing users, few people mix different types of filters. If that usecase arises in the future, we can rewrite the top level docker lable filter using a task definition filter with arn_pattern:*.

type ServiceConfig ¶

type ServiceConfig struct {
	CommonExporterConfig `mapstructure:",squash" yaml:",inline"`

	// NamePattern is mandatory.
	NamePattern string `mapstructure:"name_pattern" yaml:"name_pattern"`
	// ContainerNamePattern is optional, empty string means all containers in that service would be exported.
	// Otherwise both service and container name petterns need to metch.
	ContainerNamePattern string `mapstructure:"container_name_pattern" yaml:"container_name_pattern"`
}

type TaskDefinitionConfig ¶

type TaskDefinitionConfig struct {
	CommonExporterConfig `mapstructure:",squash" yaml:",inline"`

	// ArnPattern is mandetory, empty string means arn based match is skipped.
	ArnPattern string `mapstructure:"arn_pattern" yaml:"arn_pattern"`
	// ContainerNamePattern is optional, empty string means all containers in that task definition would be exported.
	// Otherwise both service and container name petterns need to metch.
	ContainerNamePattern string `mapstructure:"container_name_pattern" yaml:"container_name_pattern"`
}

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
internal
ecsmock Package ecsmock implements mock server for ECS service API.	Package ecsmock implements mock server for ECS service API.
errctx Package errctx allow attaching values to an error in a structural way using WithValue and read the value out using ValueFrom.	Package errctx allow attaching values to an error in a structural way using WithValue and read the value out using ValueFrom.
metadata

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL