Skip to content

aws/eks-node-monitoring-agent

EKS Node Monitoring Agent

The EKS Node Monitoring Agent detects health issues on Amazon EKS worker nodes by parsing system logs and surfacing status information through Kubernetes NodeConditions. When paired with Amazon EKS node auto repair, detected issues can trigger automatic node replacement or reboot.

For detailed configuration options and usage documentation, refer to the Amazon EKS Node Health documentation.

Overview

The agent runs as a DaemonSet on each node and monitors for issues across several categories:

  • Kernel - Process limits, kernel bugs, soft lockups
  • Networking - VPC CNI (IPAMD) issues, interface problems, connectivity
  • Storage - EBS throughput/IOPS limits, I/O delays
  • Container Runtime - Pod termination issues, probe failures
  • Accelerated Hardware - NVIDIA GPU errors (XID codes), AWS Neuron issues, DCGM diagnostics

For each category, the agent applies a dedicated NodeCondition to worker nodes (e.g., KernelReady, NetworkingReady, StorageReady, AcceleratedHardwareReady). These conditions integrate with Amazon EKS node auto repair to automatically remediate unhealthy nodes.

Project Layout

.
├── api/                    # API definitions and CRDs
├── charts/                 # Helm chart for deployment
├── cmd/                    # Application entry point
├── examples/               # Integration examples
├── hack/                   # Build and utility scripts
├── monitors/               # Health monitoring plugins
├── pkg/                    # Core packages
└── test/                   # Integration tests

Installation

It is recommended to install the EKS Node Health Monitoring Agent as an EKS add-on. For Helm installation instructions, see charts/eks-node-monitoring-agent/README.md.

For detailed configuration options and usage documentation, refer to the Amazon EKS Node Health documentation.

Configuring Monitors

By default all monitors are enabled. Individual monitors can be disabled via the Helm chart's nodeAgent.monitors configuration or by providing a config file at /etc/nma/config.yaml.

Helm Values

nodeAgent:
  monitors:
    networking:
      enabled: false
    neuron:
      enabled: false

Config File Format

The agent reads a YAML config file mounted at /etc/nma/config.yaml. Omitted monitors default to enabled.

monitors:
  kernel-monitor:
    enabled: true
  networking:
    enabled: false
  storage-monitor:
    enabled: true
  nvidia:
    enabled: true
  neuron:
    enabled: false
  runtime:
    enabled: true

Valid plugin names: kernel-monitor, networking, storage-monitor, nvidia, neuron, runtime.

When a monitor is disabled:

  • Its health checks are not executed.
  • The corresponding NodeCondition (e.g., NetworkingReady) is not set on the node, avoiding false-positive healthy status for unmonitored subsystems.

Building

# Build the binary
make build

# Run tests
make test

# Build container image
make docker-build

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines on:

  • Reporting bugs and feature requests
  • Submitting pull requests
  • Code of conduct
  • Security issue notifications

Security

If you discover a potential security issue, please report it via the AWS vulnerability reporting page. Do not create a public GitHub issue for security vulnerabilities.

See CONTRIBUTING.md for more information.

License

This project is licensed under the Apache-2.0 License. See LICENSE for the full license text.

About

Agent that detects health issues on Amazon EKS worker nodes

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors