The EKS Node Monitoring Agent detects health issues on Amazon EKS worker nodes by parsing system logs and surfacing status information through Kubernetes NodeConditions. When paired with Amazon EKS node auto repair, detected issues can trigger automatic node replacement or reboot.
For detailed configuration options and usage documentation, refer to the Amazon EKS Node Health documentation.
The agent runs as a DaemonSet on each node and monitors for issues across several categories:
- Kernel - Process limits, kernel bugs, soft lockups
- Networking - VPC CNI (IPAMD) issues, interface problems, connectivity
- Storage - EBS throughput/IOPS limits, I/O delays
- Container Runtime - Pod termination issues, probe failures
- Accelerated Hardware - NVIDIA GPU errors (XID codes), AWS Neuron issues, DCGM diagnostics
For each category, the agent applies a dedicated NodeCondition to worker nodes (e.g., KernelReady, NetworkingReady, StorageReady, AcceleratedHardwareReady). These conditions integrate with Amazon EKS node auto repair to automatically remediate unhealthy nodes.
.
├── api/ # API definitions and CRDs
├── charts/ # Helm chart for deployment
├── cmd/ # Application entry point
├── examples/ # Integration examples
├── hack/ # Build and utility scripts
├── monitors/ # Health monitoring plugins
├── pkg/ # Core packages
└── test/ # Integration tests
It is recommended to install the EKS Node Health Monitoring Agent as an EKS add-on. For Helm installation instructions, see charts/eks-node-monitoring-agent/README.md.
For detailed configuration options and usage documentation, refer to the Amazon EKS Node Health documentation.
By default all monitors are enabled. Individual monitors can be disabled via the Helm chart's nodeAgent.monitors configuration or by providing a config file at /etc/nma/config.yaml.
nodeAgent:
monitors:
networking:
enabled: false
neuron:
enabled: falseThe agent reads a YAML config file mounted at /etc/nma/config.yaml. Omitted monitors default to enabled.
monitors:
kernel-monitor:
enabled: true
networking:
enabled: false
storage-monitor:
enabled: true
nvidia:
enabled: true
neuron:
enabled: false
runtime:
enabled: trueValid plugin names: kernel-monitor, networking, storage-monitor, nvidia, neuron, runtime.
When a monitor is disabled:
- Its health checks are not executed.
- The corresponding
NodeCondition(e.g.,NetworkingReady) is not set on the node, avoiding false-positive healthy status for unmonitored subsystems.
# Build the binary
make build
# Run tests
make test
# Build container image
make docker-buildWe welcome contributions! Please see CONTRIBUTING.md for guidelines on:
- Reporting bugs and feature requests
- Submitting pull requests
- Code of conduct
- Security issue notifications
If you discover a potential security issue, please report it via the AWS vulnerability reporting page. Do not create a public GitHub issue for security vulnerabilities.
See CONTRIBUTING.md for more information.
This project is licensed under the Apache-2.0 License. See LICENSE for the full license text.