Introduce replica_count support for training and deprecate node_count (#383) by FarhanTejani · Pull Request #401 · aws/sagemaker-hyperpod-cli

FarhanTejani · 2026-03-31T03:22:16Z

What's changing and why?

Closes #383

node_count is overloaded: it sets both the number of job replicas and triggers full-node resource allocation. This means users who want multiple replicas with granular resource requests (e.g. 4 replicas each with 2 GPUs) have no way to express that.

This PR introduces replica_count to decouple replica specification from resource allocation, and deprecates node_count. The naming aligns with the inference CLI, which already uses replica_count / max_replica_count / min_replica_count.

New parameters

replica_count - number of job replicas (pods). Can be combined with resource fields.
max_replica_count - maximum replicas for elastic training.

Deprecated parameters (kept as aliases)

node_count - alias for replica_count that also triggers full-node resource allocation from instance constants (preserves existing behavior).
max_node_count - alias for max_replica_count.

Deprecated fields are marked with "deprecated": true in the schema and [DEPRECATED] in descriptions. hyp init hides deprecated fields from generated config templates. CLI --help shows the deprecation tag.

Additional cleanup

Moved the node_count + resources mutual exclusivity check from _is_valid() to the model validator, with an actionable error message guiding users to replica_count.
Standardized error messages: field names in snake_case, user values quoted, specific capacity numbers included.

Before/After UX

Before (impossible):

node_count: 4
accelerators: 2
instance_type: ml.p4d.24xlarge
# Error: Either node-count OR a combination of accelerators, vcpu, memory-in-gib must be specified

After:

replica_count: 4
accelerators: 2
instance_type: ml.p4d.24xlarge
# Works: 4 replicas, each with 2 GPUs

Existing behavior preserved:

node_count: 4
instance_type: ml.p4d.24xlarge
# Still works: 4 replicas, each with all resources of a p4d

Improved error message:

node_count: 4
accelerators: 2
instance_type: ml.p4d.24xlarge
# Error: node_count cannot be combined with resource fields (accelerators, vcpu, memory).
#        Use replica_count instead to specify replicas with granular resources.

How was this change tested?

All unit tests pass
Verified hyp init hides deprecated fields
Verified --replica-count and --max-replica-count CLI flags are auto-generated

Are unit tests added?

Yes. TestReplicaCount class with 10 tests:

replica_count + resources (full pipeline)
replica_count without resources (auto-calculates)
node_count + resources rejected
node_count + limit fields rejected
replica_count + node_count mutually exclusive
max_replica_count + max_node_count mutually exclusive
node_count backward compatibility
max_replica_count in elastic policy
replica_count + max_node_count (mixed old/new)
replica_count template rendering

Are integration tests added?

Updated existing integ test for new error message format.

Reviewer Guidelines

‼️ Merge Requirements: PRs with failing integration tests cannot be merged without justification.

One of the following must be true:

All automated PR checks pass
Failed tests include local run results/screenshots proving they work
Changes are documentation-only

…aws#383)

Introduce replica_count support for training and deprecate node_count (…

45fa1e6

…aws#383)

FarhanTejani temporarily deployed to manual-approval March 31, 2026 03:22 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce replica_count support for training and deprecate node_count (#383)#401

Introduce replica_count support for training and deprecate node_count (#383)#401
FarhanTejani wants to merge 1 commit intoaws:mainfrom
FarhanTejani:feat/replica-count

FarhanTejani commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

FarhanTejani commented Mar 31, 2026

What's changing and why?

New parameters

Deprecated parameters (kept as aliases)

Additional cleanup

Before/After UX

How was this change tested?

Are unit tests added?

Are integration tests added?

Reviewer Guidelines

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant