629.26.759.978 I slot print_timing: id 0 | task 2045 | n_decoded = 177, tg = 28.53 t/s
629.26.863.997 I slot update_slots: id 0 | task 2045 | accepted 0/ 4 draft tokens
629.26.968.266 I slot update_slots: id 0 | task 2045 | accepted 3/ 4 draft tokens
629.27.073.847 I slot update_slots: id 0 | task 2045 | accepted 3/ 4 draft tokens
629.27.178.694 I slot update_slots: id 0 | task 2045 | accepted 3/ 4 draft tokens
629.27.296.536 I slot update_slots: id 0 | task 2045 | accepted 0/ 4 draft tokens
629.27.413.468 I reasoning-budget: deactivated (natural end)
629.27.414.263 I slot update_slots: id 0 | task 2045 | accepted 4/ 4 draft tokens
629.27.516.391 I slot update_slots: id 0 | task 2045 | accepted 0/ 4 draft tokens
629.27.516.497 W slot process_toke: id 0 | task 2045 | raw tool marker observed while lazy grammar is enabled; keeping DFlash governed by active grammar boundary in_reasoning=0 n_decoded=197 reasoning_tokens=195 visible_tokens=2
629.27.618.565 I slot update_slots: id 0 | task 2045 | accepted 0/ 4 draft tokens
629.29.788.390 I slot print_timing: id 0 | task 2045 | n_decoded = 231, tg = 25.02 t/s
629.32.847.982 I slot print_timing: id 0 | task 2045 | n_decoded = 278, tg = 22.62 t/s
629.35.850.253 I slot print_timing: id 0 | task 2045 | n_decoded = 324, tg = 21.18 t/s
629.38.858.690 I slot print_timing: id 0 | task 2045 | n_decoded = 370, tg = 20.22 t/s
629.41.886.633 I slot print_timing: id 0 | task 2045 | n_decoded = 416, tg = 19.50 t/s
629.44.909.799 I slot print_timing: id 0 | task 2045 | n_decoded = 462, tg = 18.97 t/s
629.47.920.540 I slot print_timing: id 0 | task 2045 | n_decoded = 508, tg = 18.56 t/s
629.50.953.961 I slot print_timing: id 0 | task 2045 | n_decoded = 554, tg = 18.22 t/s
629.54.010.802 I slot print_timing: id 0 | task 2045 | n_decoded = 600, tg = 17.93 t/s
629.57.042.537 I slot print_timing: id 0 | task 2045 | n_decoded = 646, tg = 17.71 t/s
630.00.064.647 I slot print_timing: id 0 | task 2045 | n_decoded = 690, tg = 17.46 t/s
630.03.096.848 I slot print_timing: id 0 | task 2045 | n_decoded = 730, tg = 17.16 t/s
630.06.110.195 I slot print_timing: id 0 | task 2045 | n_decoded = 770, tg = 16.90 t/s
630.09.120.615 I slot print_timing: id 0 | task 2045 | n_decoded = 810, tg = 16.68 t/s
630.12.146.633 I slot print_timing: id 0 | task 2045 | n_decoded = 850, tg = 16.48 t/s
630.15.165.119 I slot print_timing: id 0 | task 2045 | n_decoded = 890, tg = 16.30 t/s
630.18.188.916 I slot print_timing: id 0 | task 2045 | n_decoded = 930, tg = 16.14 t/s
630.21.232.674 I slot print_timing: id 0 | task 2045 | n_decoded = 970, tg = 15.99 t/s
630.24.273.563 I slot print_timing: id 0 | task 2045 | n_decoded = 1010, tg = 15.85 t/s
630.27.286.564 I slot print_timing: id 0 | task 2045 | n_decoded = 1050, tg = 15.73 t/s
630.30.316.985 I slot print_timing: id 0 | task 2045 | n_decoded = 1090, tg = 15.62 t/s
630.33.352.460 I slot print_timing: id 0 | task 2045 | n_decoded = 1130, tg = 15.52 t/s
630.36.387.337 I slot print_timing: id 0 | task 2045 | n_decoded = 1170, tg = 15.43 t/s
630.39.416.057 I slot print_timing: id 0 | task 2045 | n_decoded = 1210, tg = 15.34 t/s
630.42.458.631 I slot print_timing: id 0 | task 2045 | n_decoded = 1250, tg = 15.26 t/s
630.45.498.988 I slot print_timing: id 0 | task 2045 | n_decoded = 1290, tg = 15.19 t/s
630.48.547.724 I slot print_timing: id 0 | task 2045 | n_decoded = 1330, tg = 15.12 t/s
630.51.580.296 I slot print_timing: id 0 | task 2045 | n_decoded = 1370, tg = 15.05 t/s
630.54.631.960 I slot print_timing: id 0 | task 2045 | n_decoded = 1410, tg = 14.99 t/s
630.57.666.954 I slot print_timing: id 0 | task 2045 | n_decoded = 1450, tg = 14.93 t/s
631.00.723.064 I slot print_timing: id 0 | task 2045 | n_decoded = 1490, tg = 14.88 t/s
631.03.773.104 I slot print_timing: id 0 | task 2045 | n_decoded = 1530, tg = 14.82 t/s
631.06.829.912 I slot print_timing: id 0 | task 2045 | n_decoded = 1570, tg = 14.77 t/s
Name and Version
beellama v0.3.2 (preview)
Operating systems
Windows
GGML backends
CUDA
Hardware
RTX 5060 Ti 16 GB + Ryzen 7 7500x
Models
MTP Qwen3.6-27B-UD-Q2_K_XL.gguf
Problem description & steps to reproduce
When running MTP model with
--spec-type draft-mtp --spec-draft-n-max 4DFlash settings take control on raw tool marker and reduce inference speed from 47t/s (code generation) or 32 t/s (reasoning) to 25->13 and lower (descending) t/s. The model fits VRAM with extra space available.Prompt:
Write a single-file HTML/JS/CSS Windows-style desktop application with six apps, a start menu, a task manager, and a console. No external dependencies. The whole desktop and each application must be ui friendly ,beautiful, fully functional, meaningful and useful, including all tabls and settings. Output to desktop.htmlRun the prompt in Kilo Code, which forces model to call "Write" tool, not to output html directly.
First Bad Commit
No response
Relevant log output
Logs