Skip to content

Introduce replica_count support for training and deprecate node_count (#383)#401

Draft
FarhanTejani wants to merge 1 commit intoaws:mainfrom
FarhanTejani:feat/replica-count
Draft

Introduce replica_count support for training and deprecate node_count (#383)#401
FarhanTejani wants to merge 1 commit intoaws:mainfrom
FarhanTejani:feat/replica-count

Conversation

@FarhanTejani
Copy link
Copy Markdown
Member

What's changing and why?

Closes #383

node_count is overloaded: it sets both the number of job replicas and triggers full-node resource allocation. This means users who want multiple replicas with granular resource requests (e.g. 4 replicas each with 2 GPUs) have no way to express that.

This PR introduces replica_count to decouple replica specification from resource allocation, and deprecates node_count. The naming aligns with the inference CLI, which already uses replica_count / max_replica_count / min_replica_count.

New parameters

  • replica_count - number of job replicas (pods). Can be combined with resource fields.
  • max_replica_count - maximum replicas for elastic training.

Deprecated parameters (kept as aliases)

  • node_count - alias for replica_count that also triggers full-node resource allocation from instance constants (preserves existing behavior).
  • max_node_count - alias for max_replica_count.

Deprecated fields are marked with "deprecated": true in the schema and [DEPRECATED] in descriptions. hyp init hides deprecated fields from generated config templates. CLI --help shows the deprecation tag.

Additional cleanup

  • Moved the node_count + resources mutual exclusivity check from _is_valid() to the model validator, with an actionable error message guiding users to replica_count.
  • Standardized error messages: field names in snake_case, user values quoted, specific capacity numbers included.

Before/After UX

Before (impossible):

node_count: 4
accelerators: 2
instance_type: ml.p4d.24xlarge
# Error: Either node-count OR a combination of accelerators, vcpu, memory-in-gib must be specified

After:

replica_count: 4
accelerators: 2
instance_type: ml.p4d.24xlarge
# Works: 4 replicas, each with 2 GPUs

Existing behavior preserved:

node_count: 4
instance_type: ml.p4d.24xlarge
# Still works: 4 replicas, each with all resources of a p4d

Improved error message:

node_count: 4
accelerators: 2
instance_type: ml.p4d.24xlarge
# Error: node_count cannot be combined with resource fields (accelerators, vcpu, memory).
#        Use replica_count instead to specify replicas with granular resources.

How was this change tested?

  • All unit tests pass
  • Verified hyp init hides deprecated fields
  • Verified --replica-count and --max-replica-count CLI flags are auto-generated

Are unit tests added?

Yes. TestReplicaCount class with 10 tests:

  • replica_count + resources (full pipeline)
  • replica_count without resources (auto-calculates)
  • node_count + resources rejected
  • node_count + limit fields rejected
  • replica_count + node_count mutually exclusive
  • max_replica_count + max_node_count mutually exclusive
  • node_count backward compatibility
  • max_replica_count in elastic policy
  • replica_count + max_node_count (mixed old/new)
  • replica_count template rendering

Are integration tests added?

Updated existing integ test for new error message format.

Reviewer Guidelines

‼️ Merge Requirements: PRs with failing integration tests cannot be merged without justification.

One of the following must be true:

  • All automated PR checks pass
  • Failed tests include local run results/screenshots proving they work
  • Changes are documentation-only

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] node_count and explicit resource fields (accelerators, vcpu, memory) are incorrectly mutually exclusive

1 participant