Skip to content

[Fix]: Handle div_scale when using Gemini plugin with CPUAdam#6408

Open
Truong5724 wants to merge 1 commit intohpcaitech:mainfrom
Truong5724:Truong_colossalai
Open

[Fix]: Handle div_scale when using Gemini plugin with CPUAdam#6408
Truong5724 wants to merge 1 commit intohpcaitech:mainfrom
Truong5724:Truong_colossalai

Conversation

@Truong5724
Copy link
Copy Markdown

📌 Checklist before creating the PR

  • I have created an issue for this PR for traceability
  • The title follows the standard format: [doc/gemini/tensor/...]: A concise description
  • I have added relevant tags if possible for us to better distinguish different PRs
  • I have installed pre-commit: pip install pre-commit && pre-commit install

🚨 Issue number

Fixed #6386

📝 What does this PR do?

Problem

  • I ran colossalai run --nproc_per_node 4 main.py on Google Colab using the Gemini plugin with CPUAdam (the relevant files and their contents are provided in the issue). The training runs successfully for the first few steps but later fails with an AssertionError.

The execution log is shown below:

[Epoch 0] step 0, loss = 2.2969
[Epoch 0] step 1, loss = 2.2988
[rank0]: Traceback (most recent call last):
[rank0]:   File "/content/ColossalAI/main.py", line 86, in <module>
[rank0]:     main()
[rank0]:   File "/content/ColossalAI/main.py", line 76, in main
[rank0]:     optimizer.step()
[rank0]:   File "/content/ColossalAI/colossalai/zero/gemini/gemini_optimizer.py", line 288, in step
[rank0]:     ret = self.optim.step(div_scale=combined_scale, *args, **kwargs)
[rank0]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/optim/optimizer.py", line 487, in wrapper
[rank0]:     out = func(*args, **kwargs)
[rank0]:           ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/content/ColossalAI/colossalai/nn/optimizer/cpu_adam.py", line 193, in step
[rank0]:     assert div_scale == -1, "div_scale should remain default"
[rank0]:            ^^^^^^^^^^^^^^^
[rank0]: AssertionError: div_scale should remain default
  • I also tested with HybridAdam and FusedAdam, and both worked correctly. This suggests that the issue is specific to the CPUAdam implementation.

  • In fact, this assertion only exists in cpu_adam.py, which causes the failure.

  • This check enforces that div_scale must remain at its default value (-1), which is too restrictive in practice.

  • In PyTorch, gradient scaling is not automatically handled at the optimizer level, and scaling factors such as div_scale may be controlled externally (e.g., by mixed precision or distributed training strategies like Gemini).

  • As a result, enforcing div_scale == -1 prevents valid use cases where a non-default scaling factor is passed, leading to unnecessary failures.


Solution

  • Removed the restrictive assertion on cpu_adam.py:
assert div_scale == -1, "div_scale should remain default"
  • Applied gradient scaling:
# scale gradient if div_scale is provided
grad = p.grad.data

if div_scale != -1:
    grad = grad / div_scale

# adam on cuda
self.torch_adam_update(
    p.data,
    grad,
    state["exp_avg"],
    state["exp_avg_sq"],
    group["lr"],
    beta1,
    beta2,
    group["eps"],
    group["weight_decay"],
    bias_correction1,
    bias_correction2,
    self.adamw_mode,
)
  • This change allows div_scale to be passed without triggering errors, making CPUAdam compatible with external gradient scaling mechanisms.

  • I have verified that the training now runs successfully after applying this fix.

💥 Checklist before requesting a review

  • I have linked my PR to an issue (instruction)
  • My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
  • I have performed a self-review of my code
  • I have added thorough tests.
  • I have added docstrings for all the functions/methods I implemented

⭐️ Do you enjoy contributing to Colossal-AI?

  • 🌝 Yes, I do.
  • 🌚 No, I don't.

@Truong5724 Truong5724 requested a review from a team as a code owner April 2, 2026 10:23
@Truong5724 Truong5724 changed the title [Fix]: Handle div_scale when using Gemini plugin with CPUAdam #6386 [Fix]: Handle div_scale when using Gemini plugin with CPUAdam Apr 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]: AssertionError: div_scale should remain default when using Gemini plugin with CPUAdam

1 participant