[Fix]: Handle div_scale when using Gemini plugin with CPUAdam by Truong5724 · Pull Request #6408 · hpcaitech/ColossalAI

Truong5724 · 2026-04-02T10:23:52Z

📌 Checklist before creating the PR

I have created an issue for this PR for traceability
The title follows the standard format: [doc/gemini/tensor/...]: A concise description
I have added relevant tags if possible for us to better distinguish different PRs
I have installed pre-commit: pip install pre-commit && pre-commit install

🚨 Issue number

Fixed #6386

📝 What does this PR do?

Problem

I ran colossalai run --nproc_per_node 4 main.py on Google Colab using the Gemini plugin with CPUAdam (the relevant files and their contents are provided in the issue). The training runs successfully for the first few steps but later fails with an AssertionError.

The execution log is shown below:

[Epoch 0] step 0, loss = 2.2969
[Epoch 0] step 1, loss = 2.2988
[rank0]: Traceback (most recent call last):
[rank0]:   File "/content/ColossalAI/main.py", line 86, in <module>
[rank0]:     main()
[rank0]:   File "/content/ColossalAI/main.py", line 76, in main
[rank0]:     optimizer.step()
[rank0]:   File "/content/ColossalAI/colossalai/zero/gemini/gemini_optimizer.py", line 288, in step
[rank0]:     ret = self.optim.step(div_scale=combined_scale, *args, **kwargs)
[rank0]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/optim/optimizer.py", line 487, in wrapper
[rank0]:     out = func(*args, **kwargs)
[rank0]:           ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/content/ColossalAI/colossalai/nn/optimizer/cpu_adam.py", line 193, in step
[rank0]:     assert div_scale == -1, "div_scale should remain default"
[rank0]:            ^^^^^^^^^^^^^^^
[rank0]: AssertionError: div_scale should remain default

I also tested with HybridAdam and FusedAdam, and both worked correctly. This suggests that the issue is specific to the CPUAdam implementation.
In fact, this assertion only exists in cpu_adam.py, which causes the failure.
This check enforces that div_scale must remain at its default value (-1), which is too restrictive in practice.
In PyTorch, gradient scaling is not automatically handled at the optimizer level, and scaling factors such as div_scale may be controlled externally (e.g., by mixed precision or distributed training strategies like Gemini).
As a result, enforcing div_scale == -1 prevents valid use cases where a non-default scaling factor is passed, leading to unnecessary failures.

Solution

Removed the restrictive assertion on cpu_adam.py:

assert div_scale == -1, "div_scale should remain default"

Applied gradient scaling:

# scale gradient if div_scale is provided
grad = p.grad.data

if div_scale != -1:
    grad = grad / div_scale

# adam on cuda
self.torch_adam_update(
    p.data,
    grad,
    state["exp_avg"],
    state["exp_avg_sq"],
    group["lr"],
    beta1,
    beta2,
    group["eps"],
    group["weight_decay"],
    bias_correction1,
    bias_correction2,
    self.adamw_mode,
)

This change allows div_scale to be passed without triggering errors, making CPUAdam compatible with external gradient scaling mechanisms.
I have verified that the training now runs successfully after applying this fix.

💥 Checklist before requesting a review

I have linked my PR to an issue (instruction)
My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
I have performed a self-review of my code
I have added thorough tests.
I have added docstrings for all the functions/methods I implemented

⭐️ Do you enjoy contributing to Colossal-AI?

🌝 Yes, I do.
🌚 No, I don't.

Truong Handle div_scale in CPUAdam Optimizer

969885e

Truong5724 requested a review from a team as a code owner April 2, 2026 10:23

Truong5724 changed the title ~~[Fix]: Handle div_scale when using Gemini plugin with CPUAdam #6386~~ [Fix]: Handle div_scale when using Gemini plugin with CPUAdam Apr 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix]: Handle div_scale when using Gemini plugin with CPUAdam#6408

[Fix]: Handle div_scale when using Gemini plugin with CPUAdam#6408
Truong5724 wants to merge 1 commit intohpcaitech:mainfrom
Truong5724:Truong_colossalai

Truong5724 commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Truong5724 commented Apr 2, 2026

📌 Checklist before creating the PR

🚨 Issue number

📝 What does this PR do?

Problem

Solution

💥 Checklist before requesting a review

⭐️ Do you enjoy contributing to Colossal-AI?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant