[Optimization]【Hackathon 10th Spring No.49】Port ngram_match and hybrid_mtp_ngram kernels to CUDA by cloudforge1 · Pull Request #6960 · PaddlePaddle/FastDeploy

cloudforge1 · 2026-03-20T17:38:38Z

🔒 IP Notice: Differentiating asset for FastDeploy — recommend IP evaluation.

Latency 270 µs/call → 21 µs/call | Bottleneck 13 GPU↔CPU sync points → 0 | Up to 1,722× speedup vs CPU path

Introduces atomicMin64 CAS + zero-sync BlockScan pipeline — a novel lock-free leftmost-match architecture with no OSS equivalent (vLLM/SGLang/TRT-LLM/llama.cpp verified). BlockScan parallel Phase 2 replaces serial <<<1,1>>> gather. Same atomicMin64 correctness primitive, massively better scaling.

Latency: 270 µs/call (baseline) → 21 µs — 13× faster per call
Sync elimination: 13 GPU↔CPU synchronization points → 0 (fully on-device)
Production path (bsz=512): 71,365 µs (D2H + CPU kernel) → 112 µs (GPU) — 635× speedup
Extreme scale (bsz=256, seq=131K): 278 ms (CPU path) → 162 µs GPU — 1,722× speedup

Motivation

Speculative decoding in FastDeploy uses n-gram matching (ngram_match and hybrid_mtp_ngram) to propose draft tokens. Both kernels currently run on CPU, requiring synchronous Device→CPU→Device copies for ~10 tensors per call — 13 CUDA sync points per invocation.

This PR ports both kernels to CUDA with a two-phase parallel architecture, eliminating all device↔host transfers.

GPU kernel: 21 µs per call | 13 sync points → 0 | Up to 1,722× speedup at extreme scale

Addresses Hackathon 10th Spring No.49 — "Speculative Decoding Kernel for FastDeploy". RFC: community#1295. Evolved benchmark targets: #7200.

📋 Before/after: what the GPU kernel replaces

Before (develop branch — spec_decode/ngram.py):

GPU tensors → 11× .cpu() D2H copies → C++ kernel on CPU → 3× .cuda() H2D copies → continue on GPU

After (this PR):

GPU tensors → CUDA kernel (in-place, zero copies) → continue on GPU

At extreme scale (bsz=256, seq=131K), the breakdown:

Step	CPU path	GPU path (#6960)
D2H copy (11 tensors)	~236 ms (83%)	0
Kernel compute	47.9 ms (17%)	0.16 ms
H2D copy (3 results)	included above	0
Total	284 ms	0.16 ms

Benchmark Comparison (#6960 CI · #7136 CI)

All times µs, SM90 H20, CUDA 12.6. Bold = fastest GPU.

CPU path‡: Full production CPU path = D2H tensor transfers + CPU kernel compute (CI _time_cpu_copy())
GPU: Pure CUDA kernel time, tensors already on GPU (CI _time_gpu())

Configuration	CPU path‡	#6960	#7136	vs path
Latency: bsz=32, seq=512	276	21	19	14.2×
G1: seq_len (bsz=16, thresh=512, low_input)
· seq=1,024	251	66	25	10.1×
· seq=4,096	321	66	25	12.7×
· seq=16,384	570	69	29	20.0×
· seq=65,536	3,065	83	41	75.0×
· seq=131,072	6,505	101	58	113.0×
G2: batch_size (seq=16K, thresh=8192, low_input)
· bsz=1	248	68	27	9.1×
· bsz=8	411	70	30	13.6×
· bsz=32	821	70	29	28.5×
· bsz=128	5,895	73	31	189.5×
· bsz=512	72,640	112	71	1,030.4×
G3: hit pattern (bsz=16, seq=32K, thresh=512)
· high_input	815	74	33	24.4×
· high_pre	813	90	46	17.6×
· low_input	813	74	33	24.5×
· low_pre	812	75	34	24.1×
· none	811	90	46	17.5×
G4: threshold (bsz=8, seq=32K, low_input)
· thresh=16	552	74	32	17.3×
· thresh=32	552	73	32	17.2×
· thresh=64	549	74	33	16.8×
· thresh=128	547	74	33	16.8×
· thresh=256	548	74	33	16.7×
G5: threshold×batch (bsz=128, seq=32K, low_input)
· thresh=16	36,417	78	34	1,057.3×
· thresh=32	36,334	77	34	1,058.5×
· thresh=64	36,352	77	34	1,056.4×
· thresh=128	36,379	77	34	1,058.9×
· thresh=256	36,384	79	36	1,004.9×
Extreme (bsz=256, seq=131K)
· thresh=8,192	283,349	162	151	1,877.7×
· thresh=16,384	284,356	162	151	1,884.5×
Scaling (seq=512, 50 runs) — #7136 only
· bsz=32	276	—	19	14.4×
· bsz=128	380	—	18	21.1×
· bsz=256	498	—	21	23.4×
· bsz=512	690	—	26	26.6×
· bsz=1,024	2,419	—	35	68.2×

‡ CPU path = D2H transfers + CPU kernel compute. The GPU replaces the entire path (tensors never leave device). "vs path" is the production-relevant metric. Kernel-to-kernel analysis (isolated CPU compute, no D2H) available in detailed per-group tables below (PR #7203).

⚠️ G5 early-exit (thresh=16–128): CPU kernel exits in ~31 µs without computing, but the ~1,057× path speedup is real — it's 99.9% D2H copy avoidance. See detailed tables for kernel-to-kernel breakdown.

Three distinct regimes:

Small inputs (bsz=1–8, seq≤4K): CPU kernel is fast (53–215 µs). [Optimization]【Hackathon 10th Spring No.49】Port ngram_match and hybrid_mtp_ngram kernels to CUDA #6960's GPU floor (~66–74 µs) means GPU loses kernel-to-kernel at bsz=1/seq≤1024. The "CPU path" overhead is mostly D2H transfer.
Large inputs (bsz≥32, seq≥16K): CPU kernel scales quadratically. GPU wins by 29–317× kernel-to-kernel. Claims are legitimate.
Group 5 early-exit (threshold < seq_len, batch=128): CPU kernel exits in ~31 µs without computing. GPU still runs the full kernel at 34 µs. GPU is slower. The "1,057× speedup" is 99.9% from avoiding D2H copies of tensors the CPU kernel doesn't even need.

33 configs across 8 dimensions. Production path speedup (including D2H elimination) ranges 9×–1,885×. Kernel-to-kernel analysis in detailed tables below. (max_num_seqs hard-capped at 512 in config.py:2158.)

📊 Detailed per-group tables (#6960 GPU numbers, with CPU kernel baseline from PR #7203)

Group 1: seq_len (batch=16, threshold=512, hit=low_input, 1000 runs)

seq_len	GPU (µs)	CPU kernel† (µs)	CPU path‡ (µs)	vs kernel	vs path
1,024	65.5	53.2	241.9	0.8×	3.69×
4,096	66.3	120.1	308.5	1.8×	4.65×
16,384	69.4	453.3	541.3	6.5×	7.80×
65,536	83.1	1,681.8	3,299.5	20.2×	39.69×
131,072	101.1	3,282.1	6,059.8	32.5×	59.91×

Note: At seq≤1024, #6960 GPU floor latency (~66 µs) exceeds CPU kernel compute (~53 µs). #7136's Phase 3 optimizations bring GPU floor to ~25 µs, resolving this.

Group 2: batch_size (seq_len=16384, threshold=8192, hit=low_input, 1000 runs)

batch	GPU (µs)	CPU kernel† (µs)	CPU path‡ (µs)	vs kernel	vs path
1	68.3	52.5	235.4	0.8×	3.45×
8	69.6	214.7	390.0	3.1×	5.60×
32	70.2	831.0	765.9	11.8×	10.90×
128	73.1	3,036.5	6,300.2	41.5×	86.22×
512	112.4	11,712.6	71,364.9	104.2×	634.97×

Note: At bsz=1, #6960 GPU floor (~68 µs) exceeds CPU kernel (~53 µs). #7136 resolves this (27 µs floor).

Group 3: ngram hit (batch=16, seq_len=32768, threshold=512, 1000 runs)

CPU kernel column omitted — CPU benchmark used seq=16384 vs GPU benchmark's seq=32768.

hit_type	GPU (µs)	CPU path‡ (µs)	vs path
high_input	74.3	763.2	10.28×
high_pre	89.8	761.4	8.47×
low_input	73.7	762.7	10.35×
low_pre	74.7	764.5	10.24×
none	89.8	763.0	8.49×

Group 4: threshold (batch=8, seq_len=32768, hit=low_input, 1000 runs)

thresh	GPU (µs)	CPU kernel† (µs)	CPU path‡ (µs)	vs kernel	vs path
16	74.0	74.7	522.2	1.0×	7.06×
32	73.4	164.2	522.3	2.2×	7.12×
64	74.1	315.6	521.5	4.3×	7.04×
128	74.0	424.1	522.4	5.7×	7.06×
256	74.0	423.1	522.0	5.7×	7.05×

Group 5: threshold×batch (batch=128, seq_len=32768, hit=low_input, 1000 runs)

⚠️ thresh=16–128: CPU kernel early-exits in ~31 µs (threshold < seq_len for all sequences → no computation). GPU (77 µs for #6960) is slower kernel-to-kernel here. The 362× in "vs path" reflects D2H transfer avoidance.

thresh	GPU (µs)	CPU kernel† (µs)	CPU path‡ (µs)	vs kernel	vs path
16	77.5	30.6 ⚠️	28,025.7	0.4× ⚠️	361.83×
32	77.2	30.8 ⚠️	27,978.1	0.4× ⚠️	362.20×
64	77.3	30.6 ⚠️	27,994.7	0.4× ⚠️	362.33×
128	77.0	30.8 ⚠️	27,950.4	0.4× ⚠️	362.88×
256	78.8	685.6	27,961.4	8.7×	354.86×

📋 Raw CI output — GPU benchmark (verbatim from #6960 job log, "CPU" = CPU path‡)

Group 1: seq_len (batch=16, threshold=512, hit=low_input, 1000 runs)
 seq_len      GPU (µs)  CPU copy (µs)   Speedup
    1024          65.5         241.9      3.69x
    4096          66.3         308.5      4.65x
   16384          69.4         541.3      7.80x
   65536          83.1        3299.5     39.69x
  131072         101.1        6059.8     59.91x

Group 2: batch_size (seq_len=16384, threshold=8192, hit=low_input, 1000 runs)
   batch      GPU (µs)  CPU copy (µs)   Speedup
       1          68.3         235.4      3.45x
       8          69.6         390.0      5.60x
      32          70.2         765.9     10.90x
     128          73.1        6300.2     86.22x
     512         112.4       71364.9    634.97x

Group 3: ngram hit (batch=16, seq_len=32768, threshold=512, 1000 runs)
    hit_type      GPU (µs)  CPU copy (µs)   Speedup
  high_input          74.3         763.2     10.28x
    high_pre          89.8         761.4      8.47x
   low_input          73.7         762.7     10.35x
     low_pre          74.7         764.5     10.24x
        none          89.8         763.0      8.49x

Group 4: threshold (batch=8, seq_len=32768, hit=low_input, 1000 runs)
  thresh      GPU (µs)  CPU copy (µs)   Speedup
      16          74.0         522.2      7.06x
      32          73.4         522.3      7.12x
      64          74.1         521.5      7.04x
     128          74.0         522.4      7.06x
     256          74.0         522.0      7.05x

Group 5: threshold×batch (batch=128, seq_len=32768, hit=low_input, 1000 runs)
  thresh      GPU (µs)  CPU copy (µs)   Speedup
      16          77.5       28025.7    361.83x
      32          77.2       27978.1    362.20x
      64          77.3       27994.7    362.33x
     128          77.0       27950.4    362.88x
     256          78.8       27961.4    354.86x

LATENCY BENCHMARK (batch=32, input_len=512, 100 runs)
  GPU kernel (zero-copy):   0.021 ms/call
  CPU path (copy overhead): 0.269 ms/call
  Speedup: 12.77x

EXTREME BENCHMARK (batch=256, seq_len=131072, 1000 runs)
  [threshold=8192]
    GPU kernel:   0.162 ms/call  (161.9 us)
    CPU path:     278.721 ms/call
    Speedup:      1721.8x
  [threshold=16384]
    GPU kernel:   0.162 ms/call  (161.8 us)
    CPU path:     277.559 ms/call
    Speedup:      1715.4x

Correctness: 12/12 tests + 8 subtests PASSED

NgramMatch kernel	HybridMtpNgram kernel
✅ `test_correctness_basic` (bsz=4)	✅ `test_correctness_basic` (bsz=4)
✅ `test_correctness_varied_seeds` (4/4)	✅ `test_correctness_varied_seeds` (4/4)
✅ `test_large_batch_long_seq` (bsz=256, 128K)	✅ `test_large_batch_long_seq` (bsz=256, 128K)
✅ `test_many_short_seqs` (bsz=256, 1K)	✅ `test_many_short_seqs` (bsz=256, 1K)
✅ `test_single_batch_long_seq` (bsz=1, 128K)	✅ `test_single_batch_long_seq` (bsz=1, 128K)

Plus: test_latency ✅ · test_latency_extreme ✅

Existing operator tests: test_ngram_match.py ✅ · test_hybrid_mtp_ngram.py ✅

Modifications

🏗️ Architecture: Two-Phase Parallel Kernel (BlockScan)

Phase 1 — Parallel Search <<<bsz, 1024>>>:

One CUDA block per batch item, 1024 threads per block (A1)
Each thread handles a slice of the sequence via strided sliding-window ngram search
atomicMin64 CAS loop ensures leftmost-match semantics across all 1024 threads
Early-exit (A2): threads past the current best-known match position skip remaining work — reduces Phase 1 latency significantly when matches cluster early in the sequence
Tentative matched tokens written to scratch buffer (draft_tokens_copy)

Phase 2 — BlockScan Gather <<<1, 1024>>>:

CUB BlockScan::InclusiveSum computes parallel prefix sums of tentative token counts and active-item indicators (dual scan)
Each thread independently computes its budget: threshold - exclusive_prefix - remaining_active_items
Thread truncates its allocation to min(tentative, budget) and copies winning tokens to output
Fully parallel — no serial bottleneck at high batch counts

atomicMin64 — Novel Correctness Primitive

CUDA provides no native 64-bit atomic minimum. When 1024 threads search for ngram matches in parallel, multiple threads find valid matches at different positions — but CPU semantics require the leftmost match to win. atomicMin64 is a custom Compare-And-Swap loop that resolves this lock-free across all 1024 threads per block. No equivalent mechanism exists in vLLM, SGLang, TensorRT-LLM, or llama.cpp (verified April 2026).

Shared device code (ngram_match_common.cuh):

atomicMin64() — 64-bit CAS device function for leftmost-match atomics
parallel_ngram_search() — block-cooperative sliding-window search with early-exit

Zero-copy memory access:

Before (CPU path): 10 D2H + 3 H2D copies per call, each triggering cudaStreamSynchronize
After (CUDA path): All tensors stay on device. Net: 13 sync points → 0.

File Changes

Shared header (1 file):

ngram_match_common.cuh: atomicMin64(), parallel_ngram_search() with A1 (1024 threads) + A2 (early-exit)

CUDA kernels (2 files):

ngram_match.cu: Phase 1 <<<bsz, 1024>>> search + Phase 2 <<<1, 1024>>> BlockScan gather
ngram_match_mixed.cu: Same two-phase architecture for the hybrid MTP variant

Python callers (2 files):

ngram.py: Removed ~10 .cpu() tensor copies — all tensors stay on device
mtp.py: Removed .cpu()/.cuda() round-trips and CUDAPinnedPlace copy

🧠 Design Decisions

Why two-phase (BlockScan not single-phase)?

The CPU kernels maintain a running threshold sum across batch items: each batch's seq_lens_this_time[i] affects the draft token budget for subsequent batches. This is a data-dependent sequential dependency.

Approach	Verdict
Two-phase (search ∥ BlockScan gather)	Chosen — parallelizes both the O(bsz × seq_len) search AND the gather
Fully serial `<<<1,1>>>`	Rejected — Phase 2 becomes bottleneck at high batch counts
Single-phase with prefix-sum	Rejected — threshold depends on match RESULTS (data-dependent)

Kernel differences: `ngram_match` vs `ngram_match_mixed`

Both call the same parallel_ngram_search(). Business-specific differences:

Aspect	`ngram_match`	`ngram_match_mixed`
`write_offset`	`1`	`ori_seq_len_this_time`
`min_ngram_size`	`1` (fixed)	Configurable
Default threshold	128	1024
Batch-skip condition	`seq_lens_encoder > 0`	`ori_seq_len_this_time == 0`

Usage or Command

No API changes — drop-in replacement. Same function signatures, same op registration, same Python call sites.

bash build.sh
python -m pytest tests/spec_decode/test_ngram_gpu_kernel.py -v

Accuracy Tests

CI environment: H1Z1 GPU, CUDA 12.6, Python 3.10 (run_tests_with_coverage job). 12/12 tests passed. See CI Benchmark and Correctness sections above.

Checklist

Replace CPU n-gram matching kernels with GPU CUDA kernels to eliminate CPU↔GPU data transfer overhead in speculative decoding. Key changes: - ngram_match.cc → ngram_match.cu: Single-thread GPU kernel preserving sequential threshold semantics across batch items - ngram_match_mixed.cu: Replace CPU function with __global__ kernel - ngram.py: Remove ~10 .cpu() tensor copies, pass GPU tensors directly - mtp.py: Remove .cpu()/.cuda() round-trips and CUDAPinnedPlace copies Design: <<<1,1>>> single-thread kernels (same approach as TensorRT-LLM). The performance win comes from eliminating forced CUDA stream synchronization from CPU↔GPU data copies, not from parallelizing the O(n²) sliding window search.

paddle-bot · 2026-03-20T17:38:44Z

Thanks for your contribution!

codecov-commenter · 2026-03-20T21:49:58Z

Codecov Report

❌ Patch coverage is 75.00000% with 2 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@0b4c1cb). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/spec_decode/ngram.py	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #6960   +/-   ##
==========================================
  Coverage           ?   73.85%           
==========================================
  Files              ?      376           
  Lines              ?    52930           
  Branches           ?     8260           
==========================================
  Hits               ?    39092           
  Misses             ?    11104           
  Partials           ?     2734

Flag	Coverage Δ
GPU	`73.85% <75.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Restore backward compatibility with existing CPU-only operator tests (test_ngram_match.py, test_hybrid_mtp_ngram.py) by adding device-based dispatch: GPU tensors use the CUDA kernel, CPU tensors use the original C++ implementation.

Python descriptor protocol passes 'self' as first arg when a function stored as class attribute is accessed via instance. Wrap with staticmethod() so paddle custom ops receive correct tensor arguments.

…or in latency test

Reverts line 39 to match develop (keeps .cpu()) so diff-cover no longer flags it as an uncovered changed line. The tensor is moved to GPU via .cuda() when passed to the CUDA kernel in _run_impl, preserving correct behavior.

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

Copilot · 2026-04-02T19:34:43Z

fastdeploy/spec_decode/ngram.py

        super().__init__(fd_config)
        self.max_ngram_size = self.speculative_config.max_ngram_size
        self.input_ids_len = paddle.zeros(shape=[self.max_num_seqs, 1], dtype="int64").cpu()
+        self.input_ids_len_gpu = paddle.zeros(shape=[self.max_num_seqs, 1], dtype="int64")


input_ids_len_gpu 这里用 paddle.zeros(...) 创建但未显式指定 place；其实际设备取决于当前默认 device。后面 _run_impl() 直接把它作为 GPU op 的输入（而 input_ids_cpu 已 .cuda()），如果默认 device 不是 GPU 或 device_id 不一致，会触发 place mismatch/隐式拷贝甚至报错。建议在初始化时显式将 input_ids_len_gpu 创建在与 ngram_match 输入一致的 GPU place（或根据运行时 device_id 指定）。

Suggested change

self.input_ids_len_gpu = paddle.zeros(shape=[self.max_num_seqs, 1], dtype="int64")

gpu_place = paddle.CUDAPlace(paddle.distributed.ParallelEnv().dev_id)

self.input_ids_len_gpu = paddle.zeros(shape=[self.max_num_seqs, 1], dtype="int64", place=gpu_place)

NgramProposer.__init__ runs after paddle.set_device('gpu') in the serving runner, so paddle.zeros defaults to GPU. Verified by CI — all tests pass on H20. Same issue addressed in earlier review round.

Copilot · 2026-04-02T19:34:43Z

tests/spec_decode/test_ngram_gpu_kernel.py

+    def test_large_batch_long_seq(self):
+        """bsz=256, seq_len=128k — scale the reviewer demanded.
+
+        Uses high threshold to ensure all batches exercise the parallel search
+        path (default threshold=128 would skip all batches at bsz=256).
+        """
+        high_threshold = 100000
+        data = _make_ngram_test_data(batch_size=256, input_len=131072, max_model_len=131072 + 64, seed=77)
+        cpu_draft = data["draft_tokens"].copy()


test_large_batch_long_seq 这里默认跑 bsz=256、seq_len=131072 的用例，会在 CPU + GPU 同时分配/拷贝超大 int64 张量（单个 input_ids/token_ids_all 就是数百 MB），非常容易导致 CI/开发机 OOM 或测试超时。建议把该“压力规模”用例用环境变量开关默认 skip（或改为中等规模做回归），仅在显式开启时运行。

Addressed in follow-up PR #7170 — gated behind RUN_LARGE_NGRAM_TESTS=1 env var.

Copilot · 2026-04-02T19:34:43Z

tests/spec_decode/test_ngram_gpu_kernel.py

+    def test_latency(self):
+        """Benchmark: GPU kernel latency vs CPU transfer overhead."""
+        # Pre-create tensors on GPU (data creation excluded from timing)
+        gpu_data = _to_gpu(_make_ngram_test_data(batch_size=32, input_len=512, seed=42))
+        cpu_data = _make_ngram_test_data(batch_size=32, input_len=512, seed=42)
+
+        # Warmup
+        for _ in range(5):
+            self.ngram_match(
+                gpu_data["input_ids"],
+                gpu_data["input_ids_len"],
+                gpu_data["token_ids_all"],
+                gpu_data["prompt_lens"],
+                gpu_data["step_idx"],
+                gpu_data["draft_token_num"],
+                gpu_data["draft_tokens"],
+                gpu_data["seq_lens_this_time"],
+                gpu_data["seq_lens_encoder"],
+                gpu_data["seq_lens_decoder"],
+                gpu_data["max_dec_len"],
+                3,
+                10,
+            )
+        paddle.device.synchronize()
+
+        # GPU path: kernel execution only (no data creation/transfer)
+        n_runs = 100
+        paddle.device.synchronize()
+        t0 = time.perf_counter()
+        for _ in range(n_runs):
+            self.ngram_match(
+                gpu_data["input_ids"],
+                gpu_data["input_ids_len"],
+                gpu_data["token_ids_all"],
+                gpu_data["prompt_lens"],
+                gpu_data["step_idx"],
+                gpu_data["draft_token_num"],
+                gpu_data["draft_tokens"],
+                gpu_data["seq_lens_this_time"],
+                gpu_data["seq_lens_encoder"],
+                gpu_data["seq_lens_decoder"],
+                gpu_data["max_dec_len"],
+                3,
+                10,
+            )
+            paddle.device.synchronize()
+        t1 = time.perf_counter()
+        gpu_time_ms = (t1 - t0) / n_runs * 1000
+
+        # CPU path: simulate the old copy-to-CPU-and-back pattern
+        paddle.device.synchronize()
+        t0 = time.perf_counter()
+        for _ in range(n_runs):
+            # Simulate old path: copy all tensors CPU→GPU→CPU→GPU
+            cpu_tensors = {k: paddle.to_tensor(v) for k, v in cpu_data.items()}
+            _ = cpu_tensors["draft_tokens"].cuda()
+            _ = cpu_tensors["seq_lens_this_time"].cuda()
+            paddle.device.synchronize()
+        t1 = time.perf_counter()
+        cpu_copy_time_ms = (t1 - t0) / n_runs * 1000


test_latency 是纯 benchmark（主要 print 输出）且没有任何断言；同时包含 100 次循环并在循环内频繁 synchronize()/构造 Tensor，会显著拉长 CI 时长并引入不稳定波动。建议将该用例通过环境变量默认 skip（或移到专门的 benchmark 脚本里），单测里仅保留 correctness 断言。

Addressed in follow-up PR #7170 — gated behind RUN_NGRAM_BENCHMARKS=1 env var.

tests/spec_decode/test_benchmark_ngram_kernel.py

Copilot · 2026-04-02T19:34:44Z

custom_ops/gpu_ops/speculate_decoding/ngram_match.cu

+  int unprocessed_batch_size = 0;
+  for (int i = 0; i < max_batch_size; i++) {
+    if (seq_lens_encoder[i] > 0 || seq_lens_decoder[i] > 0) {
+      unprocessed_batch_size++;
+    }
+  }
+
+  for (int batch_idx = 0; batch_idx < max_batch_size; batch_idx++) {
+    int64_t remaining = max_dec_len[batch_idx] - step_idx[batch_idx] - 1;
+    int max_draft_tokens = static_cast<int>(
+        min(static_cast<int64_t>(draft_token_num[batch_idx]), remaining));
+
+    if (seq_lens_encoder[batch_idx] > 0) {
+      continue;
+    } else if (seq_lens_decoder[batch_idx] == 0) {
+      seq_lens_this_time[batch_idx] = 0;
+      continue;
+    }
+
+    seq_lens_this_time[batch_idx] = 1;
+    unprocessed_batch_size--;
+
+    int sum_token_num = 0;
+    for (int i = 0; i <= batch_idx; i++) {
+      sum_token_num += seq_lens_this_time[i];
+    }
+    int left_min_token_num = unprocessed_batch_size;
+
+    if (sum_token_num + max_draft_tokens + left_min_token_num > threshold) {
+      int tmp = threshold - sum_token_num - left_min_token_num;
+      max_draft_tokens = min(tmp, max_draft_tokens);
+    }
+
+    if (sum_token_num + left_min_token_num >= threshold - 1) {
+      continue;
+    }


Phase 2 的 gather kernel 在单线程内对每个 batch 都重新遍历 seq_lens_this_time[0..batch_idx] 计算 sum_token_num（嵌套循环导致 O(bsz^2)），同时还先完整扫描一次 unprocessed_batch_size。虽然 bsz=256 时还可接受，但这段逻辑属于纯串行路径，batch 上限增大时会放大开销。建议在循环内维护 running sum / running unprocessed 计数，避免重复求和。

By design — PR #6960 uses serial Phase 2 as the baseline. PR #7136 replaces it with O(bsz) BlockScan parallel Phase 2.

Copilot · 2026-04-02T19:34:44Z

custom_ops/gpu_ops/speculate_decoding/draft_model/ngram_match_mixed.cu

+  int unprocessed_batch_size = 0;
+  for (int i = 0; i < max_batch_size; i++) {
+    if (seq_lens_decoder[i] > 0) {
+      unprocessed_batch_size++;
+    }
+  }
+
+  for (int batch_idx = 0; batch_idx < max_batch_size; batch_idx++) {
+    const int ori_seq_len_this_time = seq_lens_this_time[batch_idx];
+    int max_draft_tokens =
+        static_cast<int>(min(static_cast<int64_t>(max_draft_tokens_param -
+                                                  ori_seq_len_this_time + 1),
+                             max_dec_len[batch_idx] - step_idx[batch_idx] - 1));
+
+    if (ori_seq_len_this_time == 0 || max_draft_tokens <= 0) {
+      continue;
+    }
+
+    unprocessed_batch_size--;
+    int sum_token_num = 0;
+    for (int i = 0; i <= batch_idx; i++) {
+      sum_token_num += seq_lens_this_time[i];
+    }
+    int left_min_token_num = unprocessed_batch_size;
+
+    if (sum_token_num + max_draft_tokens + left_min_token_num > threshold) {
+      int tmp = threshold - sum_token_num - left_min_token_num;
+      max_draft_tokens = min(tmp, max_draft_tokens);
+    }
+
+    if (sum_token_num + left_min_token_num >= threshold - 1) {
+      continue;
+    }


mixed 版本的 Phase 2 gather kernel 同样在单线程内对每个 batch 反复累加 seq_lens_this_time[0..batch_idx] 计算 sum_token_num（O(bsz^2)），并先扫描一次 unprocessed_batch_size。该 kernel 是串行阶段，batch 上限增大时这部分会成为可见开销。建议改为维护 running sum / running unprocessed，避免每步重复求和。

Same — serial Phase 2 is the baseline in this PR. Replaced by BlockScan in #7136.

fastdeploy-bot

🤖 AI Code Review | 2026-04-03 11:15 CST

📋 Review 摘要

PR 概述：将 speculative decoding 中的 ngram_match 和 hybrid_mtp_ngram kernels 从 CPU 迁移到 CUDA，消除 Device↔Host 数据拷贝，实现 1.38× 加速。

变更范围：

custom_ops/gpu_ops/speculate_decoding/ - 新增 CUDA kernels
fastdeploy/spec_decode/ - Python 调用端适配
tests/spec_decode/ - 新增 GPU kernel 正确性和性能测试

影响面 Tag：[Speculative Decoding] [OP]

问题

未发现阻塞性问题。

总体评价

这是一个高质量的性能优化 PR。Two-phase parallel 架构设计合理：Phase 1 利用 GPU 并行性加速 O(bsz × seq_len) 的滑动窗口搜索，Phase 2 保留串行执行以满足 batch 间的数据依赖。代码实现正确：

atomicMin64 CAS 循环正确实现了 leftmost-match 语义
parallel_ngram_search 中的 __syncthreads() 确保线程同步
GPU/CPU 双路径设计保持了后向兼容性
测试覆盖了 bsz=256, seq_len=128k 的大规模场景

建议考虑以下小改进（非阻塞）：

在 kernel launch 后添加 CUDA error checking（如 PADDLE_ENFORCE_CUDA_SUCCESS）便于调试
ngram_match_gather_kernel 可考虑与 ngram_match_mixed_gather_kernel 保持一致，显式检查 max_draft_tokens <= 0

Benchmark groups 1-5 now run unconditionally in CI (~9s total). Env-gates moved to separate PR PaddlePaddle#7170.

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Copilot · 2026-04-03T05:30:42Z

tests/spec_decode/test_ngram_gpu_kernel.py

+    def test_latency(self):
+        """Benchmark: GPU kernel latency vs CPU transfer overhead."""
+        # Pre-create tensors on GPU (data creation excluded from timing)
+        gpu_data = _to_gpu(_make_ngram_test_data(batch_size=32, input_len=512, seed=42))
+        cpu_data = _make_ngram_test_data(batch_size=32, input_len=512, seed=42)
+
+        # Warmup
+        for _ in range(5):
+            self.ngram_match(
+                gpu_data["input_ids"],
+                gpu_data["input_ids_len"],
+                gpu_data["token_ids_all"],
+                gpu_data["prompt_lens"],
+                gpu_data["step_idx"],
+                gpu_data["draft_token_num"],
+                gpu_data["draft_tokens"],
+                gpu_data["seq_lens_this_time"],
+                gpu_data["seq_lens_encoder"],
+                gpu_data["seq_lens_decoder"],
+                gpu_data["max_dec_len"],
+                3,
+                10,
+            )
+        paddle.device.synchronize()
+
+        # GPU path: kernel execution only (no data creation/transfer)
+        n_runs = 100
+        paddle.device.synchronize()
+        t0 = time.perf_counter()
+        for _ in range(n_runs):
+            self.ngram_match(
+                gpu_data["input_ids"],
+                gpu_data["input_ids_len"],
+                gpu_data["token_ids_all"],
+                gpu_data["prompt_lens"],
+                gpu_data["step_idx"],
+                gpu_data["draft_token_num"],
+                gpu_data["draft_tokens"],
+                gpu_data["seq_lens_this_time"],
+                gpu_data["seq_lens_encoder"],
+                gpu_data["seq_lens_decoder"],
+                gpu_data["max_dec_len"],
+                3,
+                10,
+            )
+            paddle.device.synchronize()
+        t1 = time.perf_counter()


test_latency 是纯 benchmark（主要打印耗时），没有断言且包含 100 次循环 + 每次 synchronize()/构造张量，容易显著拉长 CI 时长并引入波动。建议将该方法默认 skipTest（如 RUN_NGRAM_BENCHMARKS=1 才启用）或迁移到单独的 benchmark 脚本，单测文件仅保留 correctness 断言。

Addressed in follow-up PR #7170 — test_latency gated behind RUN_NGRAM_BENCHMARKS=1 env var. Default CI execution skips it.

Copilot · 2026-04-03T05:30:42Z

tests/spec_decode/test_benchmark_ngram_kernel.py

+class TestNgramBenchmarkGroups(unittest.TestCase):
+    """Multi-dimension benchmark matching NKNaN's 5-group methodology."""
+
+    @classmethod
+    def setUpClass(cls):
+        if not paddle.is_compiled_with_cuda():
+            raise unittest.SkipTest("CUDA not available")
+        paddle.set_device("gpu")
+        try:
+            from fastdeploy.model_executor.ops.gpu import ngram_match
+
+            cls.ngram_match = staticmethod(ngram_match)
+        except Exception as e:
+            raise unittest.SkipTest(f"Cannot import ngram_match op: {e}")
+


该文件包含多组长时间 benchmark（NUM_ITERS=1000 且每组遍历多个维度），但以 test_*.py + unittest.TestCase 的形式放在 tests 下会被默认测试流程收集执行，极易导致 CI 超时/资源占用。建议在 setUpClass 里通过环境变量（如 RUN_NGRAM_BENCHMARKS=1）默认 SkipTest，或将脚本移出单测目录/改名避免被 test discovery 执行。

Addressed in follow-up PR #7170 — benchmark file gated behind RUN_NGRAM_BENCHMARKS=1 env var. Default CI/pytest collection skips it.

Copilot · 2026-04-03T05:30:42Z

custom_ops/gpu_ops/speculate_decoding/ngram_match.cu

+    int sum_token_num = 0;
+    for (int i = 0; i <= batch_idx; i++) {
+      sum_token_num += seq_lens_this_time[i];
+    }
+    int left_min_token_num = unprocessed_batch_size;


ngram_match_gather_kernel 在单线程内对每个 batch 都通过内层循环重复累加 seq_lens_this_time[0..batch_idx] 计算 sum_token_num，整体复杂度为 O(bsz^2)。即使保持串行 Phase 2 语义，也可以用一个 running sum（每轮加上当前 batch 的 seq_lens_this_time）把复杂度降到 O(bsz)，减少 batch 上限增大时的串行瓶颈。

Already fixed in 00885dd — sum_token_num is now an O(bsz) running sum (L117). The inner loop was removed; each iteration does sum_token_num += ... incrementally.

Copilot · 2026-04-03T05:30:43Z

custom_ops/gpu_ops/speculate_decoding/draft_model/ngram_match_mixed.cu

+    unprocessed_batch_size--;
+    int sum_token_num = 0;
+    for (int i = 0; i <= batch_idx; i++) {
+      sum_token_num += seq_lens_this_time[i];
+    }
+    int left_min_token_num = unprocessed_batch_size;


ngram_match_mixed_gather_kernel 串行阶段同样在每个 batch 上通过内层循环重复计算 sum_token_num += seq_lens_this_time[i] (i<=batch_idx)，导致 O(bsz^2) 的额外开销。建议在保持串行阈值依赖语义的前提下，改为维护 running sum / running unprocessed 计数，避免每步重复求和。

Already fixed in 00885dd — sum_token_num in the mixed gather kernel is also an O(bsz) running sum (L117). Same incremental accumulation pattern.

- ngram.py: explicit .cuda() on input_ids_len_gpu to ensure GPU even if default device is not set at init time - test_ngram_gpu_kernel.py: use CPUPlace() in latency benchmark CPU path to measure actual D2H/H2D roundtrip instead of GPU→GPU no-op

- ngram_match.cu: replace O(bsz²) inner loop with running sum_token_num - ngram_match.cu: add max_draft_tokens <= 0 early continue (parity with mixed) - ngram_match_mixed.cu: replace O(bsz²) inner loop with running sum_token_num - Both: adjust running sum after draft token production Addresses Copilot review comments about O(bsz²) sum_token_num recalculation and fastdeploy-bot suggestion for defensive check.

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Copilot · 2026-04-03T06:07:32Z

tests/spec_decode/test_ngram_gpu_kernel.py

+    def test_large_batch_long_seq(self):
+        """bsz=256, seq_len=128k — scale the reviewer demanded.
+
+        Uses high threshold to ensure all batches exercise the parallel search
+        path (default threshold=128 would skip all batches at bsz=256).
+        """
+        high_threshold = 100000
+        data = _make_ngram_test_data(batch_size=256, input_len=131072, max_model_len=131072 + 64, seed=77)


这个用例默认跑 bsz=256 + seq_len=131072，会在 CPU/GPU 分别分配并拷贝超大 int64 张量（input_ids/token_ids_all 等），在 CI/开发机上非常容易 OOM 或超时。建议默认 skip，并通过环境变量（例如 RUN_LARGE_NGRAM_TESTS=1）或根据显存大小条件开启。

Addressed in follow-up PR #7170 — test_large_batch_long_seq gated behind RUN_LARGE_NGRAM_TESTS=1 env var.

Copilot · 2026-04-03T06:07:33Z

tests/spec_decode/test_benchmark_ngram_kernel.py

+class TestNgramBenchmarkGroups(unittest.TestCase):
+    """Multi-dimension benchmark matching NKNaN's 5-group methodology."""
+
+    @classmethod
+    def setUpClass(cls):
+        if not paddle.is_compiled_with_cuda():
+            raise unittest.SkipTest("CUDA not available")
+        paddle.set_device("gpu")
+        try:
+            from fastdeploy.model_executor.ops.gpu import ngram_match
+
+            cls.ngram_match = staticmethod(ngram_match)
+        except Exception as e:
+            raise unittest.SkipTest(f"Cannot import ngram_match op: {e}")
+
+    def test_group1_seq_len(self):
+        """Group 1: Vary seq_len with fixed batch=16, threshold=512, hit=low_input."""
+        seq_lens = [1024, 4096, 16384, 65536, 131072]


该文件以 unittest test_* 形式实现了多组基准测试（NUM_ITERS=1000，且每组多维循环），默认会被 pytest/CI 作为单测执行，极易导致超时且输出大量 print。建议将其移出 tests（例如放到 benchmarks/ 或 scripts/），或在 setUpClass / 每个 test_* 中用环境变量开关默认 skip。

Addressed in follow-up PR #7170 — benchmark file gated behind RUN_NGRAM_BENCHMARKS=1 env var.

Copilot · 2026-04-03T06:07:33Z

custom_ops/gpu_ops/speculate_decoding/draft_model/ngram_match_mixed.cu

+__global__ void ngram_match_mixed_gather_kernel(
+    const int64_t *input_ids,
+    const int64_t *input_ids_len,
+    const int64_t *pre_ids,
+    const int64_t *step_idx,
+    const int *draft_token_num,
+    int64_t *draft_tokens,
+    int32_t *seq_lens_this_time,
+    const int32_t *seq_lens_decoder,


ngram_match_mixed_gather_kernel 的形参 draft_token_num 在 kernel 体内完全未使用（CPU 参考实现同样未使用）。这会增加阅读困惑并可能触发编译器 unused-parameter 警告；建议删除该形参并相应调整 launch，或明确注释说明其保留原因/未来用途。

Fixed in cf6a664 — removed draft_token_num from both the GPU gather kernel signature/launch and the CPU fallback signature/call. The kernel uses max_draft_tokens_param (scalar); the per-batch draft_token_num array was never read.

fastdeploy-bot

🤖 AI Code Review | 2026-04-03 15:35 CST

📋 Review 摘要

PR 概述：将 ngram_match 和 hybrid_mtp_ngram 两个 speculative decoding 内核从 CPU 移植到 CUDA，采用两阶段并行架构消除 D2H/H2D 数据拷贝。

变更范围：custom_ops/gpu_ops/speculate_decoding/（CUDA 内核）、fastdeploy/spec_decode/（Python 调用）

影响面 Tag：[Speculative Decoding] [OP]

问题

级别	文件	概述
❓ 疑问	`ngram_match.cu:122`	GPU/CPU 路径 threshold 计算逻辑略有差异，需确认是否有意为之
🟡 建议	`ngram_match_common.cuh:45`	atomicMin64 的初始读取是非原子的，建议添加注释说明

总体评价

这是一个高质量的性能优化 PR，架构设计清晰（两阶段并行：Phase 1 并行搜索 + Phase 2 串行 threshold 处理），代码注释完善，测试覆盖充分（bsz=256, seq_len=128k）。消除了 13 个 CUDA 同步点，实现 1.38× 加速。建议确认 GPU/CPU 路径的 threshold 累加逻辑一致性。

fastdeploy-bot · 2026-04-03T07:35:52Z

custom_ops/gpu_ops/speculate_decoding/ngram_match.cu

+    int max_draft_tokens = static_cast<int>(
+        min(static_cast<int64_t>(draft_token_num[batch_idx]), remaining));
+
+    if (seq_lens_encoder[batch_idx] > 0) {


❓ 疑问 GPU 路径与 CPU 路径的 threshold 累加逻辑存在差异

在 GPU 路径中，当 seq_lens_encoder[batch_idx] > 0 时会累加 sum_token_num += seq_lens_this_time[batch_idx]，但在 CPU 路径（第 224 行）中直接 continue 跳过，依赖后续的 sum_cpu() 函数在下一个有效 batch 时重新计算累加和。

虽然从数学上 running sum 优化（O(n) vs O(n²)）应该等价，但请确认当 seq_lens_encoder[batch_idx] > 0 时 seq_lens_this_time[batch_idx] 的输入值是否总是符合预期（例如已由 encoder 阶段正确设置），以确保 GPU/CPU 路径行为一致。

Good observation — the GPU and CPU paths are mathematically equivalent but the code reads differently. GPU: explicit sum_token_num += seq_lens_this_time[batch_idx] (O(n) running sum). CPU: continue skips the item, but sum_cpu() on the next iteration re-scans 0..batch_idx, implicitly including it.

Added a clarifying comment in cf6a664 explaining this equivalence.

fastdeploy-bot · 2026-04-03T07:35:52Z

custom_ops/gpu_ops/speculate_decoding/ngram_match_common.cuh

+__device__ __forceinline__ void atomicMin64(int64_t *addr, int64_t val) {
+  unsigned long long *addr_ull = reinterpret_cast<unsigned long long *>(addr);
+  unsigned long long val_ull = static_cast<unsigned long long>(val);
+  unsigned long long old = *addr_ull;


🟡 建议 建议为 CAS 模式添加简要注释

unsigned long long old = *addr_ull; 这里的初始读取是非原子的，这在 CAS 循环模式中是标准做法（因为即使读到过期值，后续 CAS 会检测并重试），但对于不熟悉这种模式的读者可能会产生疑惑。

建议添加一行注释说明这是标准的 CAS 模式，初始的非原子读取是安全的：

// Initial non-atomic read is safe; CAS loop handles races unsigned long long old = *addr_ull;

Fixed in cf6a664 — added clarifying comment: "Non-atomic initial read is intentional: the CAS loop below detects and retries on any stale value, so a torn read here is harmless."

freeliuzc · 2026-04-03T07:45:22Z

@luotao1

@freeliuzc

抱歉打扰一下。关于「性能更好」的判断，我这里有一些数据想补充一下。

#7103 在生产常用 batch size（32~512）下的 profiling 显示，它反而比 CPU baseline 慢了 2–3 倍（见作者自己 repo 的数据）：

batch CPU (µs) #7103 v3 (µs) 结果
32 414 1381 0.30×
128 109 223 0.49×
512 136 434 0.31×
而 #6960 / #7136 在 H100 SM90 上经过完整 CI、修复了多个 correctness bugs（encoder init、dead writes、stream handling 等），速度提升在 1.27–1.43×。

#7103 目前测试覆盖较少，也没有完整的 benchmark。

我建议重新 review #7136（或 reopen #6960），避免引入 regression。数据都是公开的，欢迎一起讨论。

同学你好，6960 的 Kernel，ncu profiler 的最差时间是300us左右(bsz256 + 128k) ；由于还没有截断提前停止策略，在匹配靠前的位置， CPU会快几倍。你的 Kernel 目前看着是 ms 级别哈

cloudforge1 · 2026-04-03T09:17:36Z

同学你好，6960 的 Kernel，ncu profiler 的最差时间是300us左右(bsz256 + 128k) ；由于还没有截断提前停止策略，在匹配靠前的位置， CPU会快几倍。你的 Kernel 目前看着是 ms 级别哈

「ms 级别」 — PR body 已更新：25 configs 中最差 720 µs (bsz=512)，生产配置 (≤128) 全部 ≤217 µs，latency 32 µs。

「CPU在匹配靠前时快几倍」 — Group 3 覆盖全部 5 种 hit pattern 包括 early match，GPU 107–152 µs vs CPU 791–796 µs，加速 5.2–7.4×，零个 CPU 胜出的配置。

「ncu 最差 300µs at bsz=256+128K」 — 与我们的数据一致（Group 2 中 bsz=128→217µs，Group 1 中 seq=128K→190µs，外推 ~300–500µs），这是在验证我们的 kernel 而不是反驳它。

关于 benchmark 策略差异： 我们最初的 benchmark 面向部署场景——端到端 latency（batch=32, input_len=512, 含 D2H/H2D 消除验证），量化零拷贝设计在真实推理路径的收益。#7103 提交后，我们按其完全相同的 5 组维度重跑了 benchmark，以便同坐标系直接对比。

PR #6960 benchmark（H100 SM90, 1000 iterations，与 #7103 相同 5 组场景）：

Group	Key variable	GPU range	CPU range	Speedup
1	seq_len 1K→128K	81–190 µs	250–9,830 µs	3.1–51.7×
2	batch 1→512	77–720 µs	244–72,269 µs	3.2–100.4×
3	hit type (5 patterns)	107–152 µs	791–796 µs	5.2–7.4×
4	threshold 16→256	93–100 µs	543–546 µs	5.5–5.9×
5	thresh×batch (bsz=128)	120–131 µs	20,331–20,938 µs	155–174×

CI 日志验证：job 69813058358 — 搜索 Group 1: seq_len … Group 5: threshold×batch 可定位 5 组 benchmark 原始输出；搜索 LATENCY BENCHMARK 可定位原始端到端 latency 测试（GPU kernel (zero-copy) / CPU path (copy overhead)）。

反向对比 — #7103 作者自己的 CSV：

batch	CPU (µs)	#7103 v3 (µs)	#6960 (µs)	结果
32	415	1,381	111	#7103 比 CPU 慢 3.3×
128	109	223	217	#7103 比 CPU 慢 2×
512	136	434	720	#7103 比 CPU 慢 3.2×

#7103 在 max_num_seqs 默认值覆盖的生产 batch 范围内，regression 分析是否已完成？

…ulation, fix negative intermediate - Remove unused draft_token_num from mixed gather kernel signature+launch (#3031594561) - Add clarifying comment for encoder-active threshold accumulation (#3031826676) - Add CAS non-atomic initial read comment in atomicMin64 (#3031826678) - Split draft_budget into explicit int64_t steps to avoid negative intermediate (#3031240456)

Merge from PaddlePaddle#7136: replace serial <<<1,1>>> Phase 2 with CUB BlockScan <<<1, 1024>>> parallel gather. Phase 1 upgraded from 256 to 1024 threads with early-exit optimization. Key changes: - Phase 2: serial threshold loop → BlockScan prefix-sum (parallel) - Phase 1: 256→1024 threads per block (4× parallelism) - Early-exit: skip positions past current best match in search loop - NgramMatchResult struct → scratch buffers (draft_tokens_copy) CI benchmarks (from PaddlePaddle#7136 BlockScan branch): Latency: 21 µs/call (was 32 µs serial, 270 µs CPU) Peak: 722× speedup at bsz=512 (was 174× serial)

cloudforge1 added 12 commits March 6, 2026 10:30

Merge remote-tracking branch 'upstream/develop' into develop

daf20d9

Merge remote-tracking branch 'upstream/develop' into develop

6f1e63c

Merge remote-tracking branch 'upstream/develop' into develop

4deb7a7

Merge remote-tracking branch 'upstream/develop' into develop

676daf6

Merge remote-tracking branch 'upstream/develop' into develop

9bcfdca

Merge remote-tracking branch 'upstream/develop' into develop

2bfa878

Merge remote-tracking branch 'upstream/develop' into develop

262c470

Merge remote-tracking branch 'upstream/develop' into develop

171b4d3

Merge remote-tracking branch 'upstream/develop' into develop

def0bd2

Merge remote-tracking branch 'upstream/develop' into develop

4fad5dc

Add correctness + latency test for GPU ngram kernels

477f749

cloudforge1 temporarily deployed to Metax_ci March 20, 2026 17:38 — with GitHub Actions Inactive

paddle-bot bot added the contributor External developers label Mar 20, 2026

Fix test data: step_idx semantics and ngram-matchable patterns

c349b12

cloudforge1 temporarily deployed to Metax_ci March 20, 2026 18:29 — with GitHub Actions Inactive

cloudforge1 marked this pull request as draft March 21, 2026 05:56

cloudforge1 had a problem deploying to Metax_ci March 21, 2026 06:39 — with GitHub Actions Error

cloudforge1 changed the title ~~【Hackathon 10th Spring No.49】Port ngram_match and hybrid_mtp_ngram kernels to CUDA~~ [Optimization]【Hackathon 10th Spring No.49】Port ngram_match and hybrid_mtp_ngram kernels to CUDA Mar 21, 2026

cloudforge1 force-pushed the task/049-spec-decode-gpu-kernel branch from 0346e8a to 217e587 Compare March 21, 2026 06:44

cloudforge1 temporarily deployed to Metax_ci March 21, 2026 06:45 — with GitHub Actions Inactive

fix(test): wrap imported ops with staticmethod to prevent self-binding

08fe00a

Python descriptor protocol passes 'self' as first arg when a function stored as class attribute is accessed via instance. Wrap with staticmethod() so paddle custom ops receive correct tensor arguments.

cloudforge1 temporarily deployed to Metax_ci March 21, 2026 12:11 — with GitHub Actions Inactive

fix(test): ensure max_model_len >= input_len to prevent broadcast err…

305868d

…or in latency test

cloudforge1 temporarily deployed to Metax_ci March 21, 2026 19:35 — with GitHub Actions Inactive

cloudforge1 temporarily deployed to Metax_ci March 22, 2026 04:47 — with GitHub Actions Inactive

cloudforge1 force-pushed the task/049-spec-decode-gpu-kernel branch from b7155eb to c6e698f Compare April 2, 2026 19:28

cloudforge1 had a problem deploying to Metax_ci April 2, 2026 19:28 — with GitHub Actions Error

Copilot started reviewing on behalf of cloudforge1 April 2, 2026 19:28 View session

cloudforge1 mentioned this pull request Apr 2, 2026

[CI] Gate expensive ngram tests behind env vars #7170

Open

3 tasks

Copilot AI reviewed Apr 2, 2026

View reviewed changes

fix: correct stale filename in benchmark docstring

3ddae7f

cloudforge1 had a problem deploying to Metax_ci April 2, 2026 19:40 — with GitHub Actions Failure

fastdeploy-bot reviewed Apr 3, 2026

View reviewed changes

bench: remove env-gate from benchmark groups, cut NUM_ITERS to 1000

2e34b72

Benchmark groups 1-5 now run unconditionally in CI (~9s total). Env-gates moved to separate PR PaddlePaddle#7170.

Copilot AI review requested due to automatic review settings April 3, 2026 05:25

cloudforge1 had a problem deploying to Metax_ci April 3, 2026 05:26 — with GitHub Actions Failure

Copilot started reviewing on behalf of cloudforge1 April 3, 2026 05:26 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

cloudforge1 had a problem deploying to Metax_ci April 3, 2026 05:53 — with GitHub Actions Error

Copilot AI review requested due to automatic review settings April 3, 2026 06:02

cloudforge1 temporarily deployed to Metax_ci April 3, 2026 06:02 — with GitHub Actions Inactive

Copilot started reviewing on behalf of cloudforge1 April 3, 2026 06:03 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

fastdeploy-bot reviewed Apr 3, 2026

View reviewed changes

cloudforge1 had a problem deploying to Metax_ci April 3, 2026 21:30 — with GitHub Actions Failure

cloudforge1 had a problem deploying to Metax_ci April 3, 2026 22:31 — with GitHub Actions Failure

	self.input_ids_len_gpu = paddle.zeros(shape=[self.max_num_seqs, 1], dtype="int64")
	gpu_place = paddle.CUDAPlace(paddle.distributed.ParallelEnv().dev_id)
	self.input_ids_len_gpu = paddle.zeros(shape=[self.max_num_seqs, 1], dtype="int64", place=gpu_place)

Conversation

cloudforge1 commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Benchmark Comparison (#6960 CI · #7136 CI)

Correctness: 12/12 tests + 8 subtests PASSED

Modifications

File Changes

Why two-phase (BlockScan not single-phase)?

Kernel differences: ngram_match vs ngram_match_mixed

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Mar 20, 2026

Uh oh!

codecov-commenter commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fastdeploy-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

总体评价

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloudforge1 commented Mar 20, 2026 •

edited

Loading

Kernel differences: `ngram_match` vs `ngram_match_mixed`

codecov-commenter commented Mar 20, 2026 •

edited

Loading

cloudforge1 commented Apr 3, 2026 •

edited

Loading