Skip to content

Bump vllm from 0.10.1.1 to 0.23.0 in /sdks/python/container/ml/py310#39002

Open
dependabot[bot] wants to merge 1 commit into
masterfrom
dependabot/pip/sdks/python/container/ml/py310/vllm-0.23.0
Open

Bump vllm from 0.10.1.1 to 0.23.0 in /sdks/python/container/ml/py310#39002
dependabot[bot] wants to merge 1 commit into
masterfrom
dependabot/pip/sdks/python/container/ml/py310/vllm-0.23.0

Conversation

@dependabot

@dependabot dependabot Bot commented on behalf of github Jun 17, 2026

Copy link
Copy Markdown
Contributor

Bumps vllm from 0.10.1.1 to 0.23.0.

Release notes

Sourced from vllm's releases.

v0.23.0

vLLM v0.23.0 Release Notes

Please note that Minimax M3 is not yet supported in this version. Please follow vLLM recipe for usage guides for M3.

Highlights

This release features 408 commits from 200 contributors (63 new)!

  • DeepSeek-V4 matures across backends: Following its introduction in v0.22.0, DeepSeek-V4 received another large hardening and optimization pass. Its sparse MLA metadata is now decoupled from DeepSeek-V3.2 (#44699), it gained a TRTLLM-gen attention kernel (#43827), EPLB support for the Mega-MoE (#43339), selective prefix-cache retention for sliding-window KV cache (#43447), and an index-share feature for DSA MTP (#44420). The model was also detached from torch.compile (#43746, #43891), its attention and RoPE paths were refactored (#44569, #44262, #43926), and an XPU attention decode path was added (#42953).
  • Model Runner V2 expands to more dense models: MRv2 is now selected by default for Llama and Mistral dense models (#43458) in addition to Qwen3. It gained a FlashInfer sampler (#42472), breakable CUDA graphs (#44050), pipeline-parallel bubble elimination (#42187), kernel block-size support for hybrid models (#38831), and Gemma 4 MTP (#43241).
  • Rust frontend grows up: The experimental Rust frontend added a streaming generate endpoint (#43779), dynamic LoRA endpoints (#43778), /version (#43854) and /server_info (#43942) endpoints, a server-router extension hook (#43774), request-ID headers (#43883), and many new tool parsers (InternLM2 #43481, hy_v3 #43872, Phi-4-mini #44213, Gemma4 #43850).
  • Gemma 4: Added encoder-free Gemma 4 Unified support (#44429) and Gemma 4 MTP (#43241), plus numerous accuracy and startup fixes.
  • Transformers v5 compatibility: vLLM now targets Transformers v5, with vendored MiniCPM-V/O processors (#44282) and compatibility fixes for Sarvam (#38804) and Voxtral (#44559).
  • Multi-tier KV cache offloading: The offloading framework gained an object-store secondary tier (#41968), HMA enabled by default for capable connectors (#41847), tiering support for HMA models (#44287), and a per-request offloading policy via the on_new_request lifecycle hook (#43205).
  • Unified parser: Reasoning and tool-call parsing are now unified behind a single Parser.parse() interface (#44267), with the Responses parser migrated to it (#42977).

Model Support

  • New models: Step-3.7-Flash (#43859), Cosmos3 Reasoner (#43356), Gemma 4 Unified encoder-free (#44429), JetBrains Mellum v2 (#43992), Granite Speech Plus (#43519), Cohere Mini Code (#44707).
  • Gemma 4: Encoder-free Unified support (#44429), MTP (#43241), native ViT linear layers (#43798), vision-embedder excluded from quantization (#44571), and fixes for MTP under TP>1 (#43909), block-table mismatch under concurrency (#43982), transformers-processor startup crash (#44232), and CPU init (#44615).
  • Transformers v5: Vendor MiniCPM-V/O processors (#44282), Sarvam compat (#38804), Voxtral fetch_audio for transformers≥5.10 (#44559).
  • Model fixes & enhancements: Qwen3-VL/Qwen3-omni-thinker deepstack accuracy under torch.compile (#43617), EVS for Qwen3-VL (#44205), GLM-5.1 PP loading (#42944), GLM-4.1V processor logits (#43575), GLM-4.6V video loader (#44417), OlmoHybrid init (#43846), HyperCLOVAX remote-code removal (#43860), Bailing-MoE rotary factor (#43770), Step3 PP residual KeyError (#37622), MiniCPM-V-4.6 video (#44509), MiniCPM-O audio unpadding (#38053), MiniCPM-V batched preprocessing (#44609), FunASR-Nano init (#44215), Cohere routing method (#44021), Kimi-K2.5 FlashInfer ViT metadata (#44493).
  • Multimodal: Auto-select registered video loader for VLMs (#44126), O(log n) multimodal item handling per step (#44212), local image encoding in benchmarks (#43843), interleaved custom image benchmark datasets (#43636).
  • Pooling/Classification: Proper exceptions for pooling UX (#44593), extra_repr() for pooler classes (#44805), LoRA-adapter-name pooling fix (#44410), resettled generative scoring entrypoint (#44153), expanded pooler unit tests (#43818, #44471).
  • Refactor: AutoWeightsLoader for InternLM2 (#38278).

Engine Core

  • Model Runner V2: Default for Llama and Mistral dense models (#43458), FlashInfer sampler (#42472), breakable CUDA graphs (#44050), removed Eagle's dedicated CUDA graph pool (#44078), pipeline-parallel bubble elimination (#42187), kernel block size for hybrid models (#38831), zeroing of freshly allocated KV blocks for hybrid + FP8 KV cache (#43990), actual batch max_seq_len for attention metadata (#43991), rejection-sampling acceptance-rate fix (#40651), KVConnector + PP cleanup (#43732), speculator-prefill warmup/capture (#44253).
  • Speculative decoding (DFlash): Causal DFlash (#43445), proper lookahead-slot allocation (#43733), prefix-cache corruption fix (#42971); independent drafter attention-backend selection (#39930), attention-group split by num_heads_q for drafts (#43543), EAGLE/MTP lookahead caching in the SWA prefix-cache mask (#44082).
  • Attention & hybrid/Mamba: FlexAttention/FlashAttention num-blocks-first layouts (#42095), OOT MLA prefill backend registration (#43325), FlashAttention upstream sync (#44065), Mamba LINEAR attention-module refactor (#43556), corrupted MLA + linear attention fix (#43961), KDA conv-state unification (#44539) and gate/cumsum fusion (#43667), Mamba SSD do_not_specialize (#43803), Qwen3.5 mixed prefill+decode split routing (#44700), MiniMax-M2 gate kernel (#38445).
  • KV cache & scheduler: Pluggable KVCacheSpec (#37505), scheduler_block_size threaded into KVCacheManager/Coordinator (#44165), max_concurrent_batches moved to VllmConfig (#44274), config validation rejecting 0/negative knobs (#43794, #44057, #44207), KV-cache scale boilerplate removed from weight loading (#43167).
  • Core: Freeze the garbage collector in workers after model init (#44363), sparse NCCL weight transfer for in-place updates (#40096), graceful spinloop ext-load failure handling (#43659), scheduled-function deprecations (#43358).

Large Scale Serving & Distributed

  • KV cache offloading: Object-store secondary tier (#41968), HMA on by default for capable connectors (#41847) and tiering (#44287), per-request offloading policy (on_new_request) (#43205) and on_schedule_end() hook (#44206), token-offset selective offload (#39983), skip decode-phase blocks in CPU offload (#43797), page-size block alignment (#43689), Triton fast-path for small CPU→GPU swap_blocks_batch (#42212), stale sliding-window block fix (#42959).
  • KV connectors / disaggregated serving: PP-aware handshake aggregation and intermediate-PP output plumbing (#43720), multiple-async-KV-load deadlock fix (#44560), Nixl Mamba prefix-caching mode (#42554), NixlConnector kv_both role deprecation cycle (#43874), Mooncake fixes (#43742, #44103, #42694), LMCache LMCacheMPConnector (#42865), EC connector shutdown API (#42423) and non-blocking lookup (#41627), KV-transfer tokens excluded from iteration_tokens_total (#43346).
  • EPLB: Async EPLB by default (#43219), EPLB for DeepSeek-V4 Mega-MoE (#43339), Nixl zero-copy EPLB transfers (#41633).
  • Data parallel: DP Ray placement groups on specific nodes (#44669) and grouped-node allocation fix (#43998), SSL for the DP supervisor (#43688), DP-coordinator startup timeout raised to 120s (#42343), per-GPU-worker RDMA NIC selection (#42083).

Hardware & Performance

  • NVIDIA / kernels: FP8 FlashInfer attention for ViT (#38065), Triton MoE backend on Hopper by default (#44220), CUTLASS FP8 scaled-mm padding bypass (+20%) (#43706), MoE-permute buffer pre-allocation (+9–14%) (#43014), Fp8BlockScaledMM new_empty() optimization (#43677), TurboQuant shared dequant buffers (#40941), tuned selective_state_update for H200/RTX PRO (#44251), Inductor fast-path fallback for vLLM/AITER custom ops (#42129), Gemma RMS all-reduce fusion (#42646), NUMA auto-binding on DGX B300 (#43270).
  • AMD ROCm: ROCm 7.2.3 (#43136), AITER v0.1.13.post1 (#44265), native W4A16 (#41394) and fused-MoE W4A16 HIP (#44075) kernels for RDNA3 (gfx1100), AITER top-k/top-p sampler by default (#43331), attention-sink support in AITER FA (#43817), AITER hipBLASLt GEMM online tuning (#40426), permute_cols for ROCm (#44674), blocks-first KV layout for AMD (#43660), N=5 wvSplitK for spec decode (#40687), MoRI connector improvements (#43303, #41751, #40344).
  • Intel XPU: vllm-xpu-kernel v0.1.7 (#41019), block_fp8_moe (#42139), block-scaled W8A8 FP8 path (#39968), WNA16 oracle for GPTQ sym-int4 (#41426), rms_norm/act quant fusions (#43963), GDN-attention MTP (#43565), Triton selective-scan op (#43421), transparent sleep mode (#37149), CPU/tiering offloading on XPU (#36423), DeepSeek-V4 attention decode path (#42953).
  • CPU & other architectures: zentorch-accelerated W8A8/W4A16 on AMD Zen CPUs (#41813), CPU top-k/top-p Triton sampling (#43633), non-divisible GQA decode in mixed batches (#43032), cpu_awq folded into awq_marlin (#43841), RISC-V RVV WNA16 helpers (#42730), fused GDN gated-delta-rule kernels (#43534), PowerPC SHM communicator (#43754), arm64 CI image (#41303).
  • TPU: tpu-inference upgraded to v0.20.0 (#43394) then v0.21.0 (#44621).
  • torch stable ABI: Continued migration of kernels to the libtorch stable ABI — merge_attn_states/mamba/sampler [8/n] (#43361), attention/cache kernels [9/n] (#43717), header files (#44013), cuda_view/silu_and_mul [10/n] (#44334), custom all-reduce/DeepSeek-V4 fused MLA/MXFP8 MoE [10b/n] (#44365); ROCm fallback to regular ABI (#44648), _has_module trial-import verification (#44035).

Quantization

  • ModelOpt: LM-head quantization (#42124), MXFP8 non-gated MoE (#42958).
  • compressed-tensors: WNA8O8Int linears and WNInt embeddings (#44340), asymmetric MoE WNA16 Marlin (#44025), single-class NVFP4 linear refactor (#42443).

... (truncated)

Commits
  • 0fc695f [Bugfix][Frontend] Cap fastapi < 0.137 to avoid prometheus-fastapi-instrument...
  • 91df0fa [Bugfix][CPU] Don't build triton-cpu on arm64 release image (#45401)
  • 78743ab [Docker] Fix CUTLASS DSL cu13 install order in Dockerfile (#45204)
  • b2d7294 [ROCm][Bugfix] Make intermediate_pad TP-aware in rocm_aiter_fused_experts (#4...
  • 741ba42 [Bugfix] [DSV4] [ROCm] Pin apache-tvm-ffi version to 0.1.10 (#45169)
  • ac94893 [ROCm][MLA][Bugfix] Reserve FP8 prefill workspace before lock for Kimi-K2.5 (...
  • 967c5c3 [ROCm][CI] Stage C mirrors (#42793)
  • 54c660c [XPU][Minor] format moe kernel name and add in kernel list (#44771)
  • 8fb0274 [MM][CG] Simplify ViT CUDA graph interfaces (#44484)
  • eebce65 [XPU]feat: add DeepSeek-V4 XPU attention decode path (#42953)
  • Additional commits viewable in compare view

Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

  • @dependabot rebase will rebase this PR
  • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
  • @dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
  • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    You can disable automated security fix PRs for this repo from the Security Alerts page.

Bumps [vllm](https://github.com/vllm-project/vllm) from 0.10.1.1 to 0.23.0.
- [Release notes](https://github.com/vllm-project/vllm/releases)
- [Changelog](https://github.com/vllm-project/vllm/blob/main/RELEASE.md)
- [Commits](vllm-project/vllm@v0.10.1.1...v0.23.0)

---
updated-dependencies:
- dependency-name: vllm
  dependency-version: 0.23.0
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
@dependabot dependabot Bot added dependencies Pull requests that update a dependency file python labels Jun 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file docker python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants