Skip to content

Eval bug: DFlash kills inference speed on non-DFlash inference (MTP, tool calls) #58

@ethernidee

Description

@ethernidee

Name and Version

beellama v0.3.2 (preview)

Operating systems

Windows

GGML backends

CUDA

Hardware

RTX 5060 Ti 16 GB + Ryzen 7 7500x

Models

MTP Qwen3.6-27B-UD-Q2_K_XL.gguf

Problem description & steps to reproduce

When running MTP model with --spec-type draft-mtp --spec-draft-n-max 4 DFlash settings take control on raw tool marker and reduce inference speed from 47t/s (code generation) or 32 t/s (reasoning) to 25->13 and lower (descending) t/s. The model fits VRAM with extra space available.

Prompt:

Write a single-file HTML/JS/CSS Windows-style desktop application with six apps, a start menu, a task manager, and a console. No external dependencies. The whole desktop and each application must be ui friendly ,beautiful, fully functional, meaningful and useful, including all tabls and settings. Output to desktop.html

Run the prompt in Kilo Code, which forces model to call "Write" tool, not to output html directly.

First Bad Commit

No response

Relevant log output

Logs
629.26.759.978 I slot print_timing: id  0 | task 2045 | n_decoded =    177, tg =  28.53 t/s
629.26.863.997 I slot update_slots: id  0 | task 2045 | accepted  0/ 4 draft tokens
629.26.968.266 I slot update_slots: id  0 | task 2045 | accepted  3/ 4 draft tokens
629.27.073.847 I slot update_slots: id  0 | task 2045 | accepted  3/ 4 draft tokens
629.27.178.694 I slot update_slots: id  0 | task 2045 | accepted  3/ 4 draft tokens
629.27.296.536 I slot update_slots: id  0 | task 2045 | accepted  0/ 4 draft tokens
629.27.413.468 I reasoning-budget: deactivated (natural end)
629.27.414.263 I slot update_slots: id  0 | task 2045 | accepted  4/ 4 draft tokens
629.27.516.391 I slot update_slots: id  0 | task 2045 | accepted  0/ 4 draft tokens
629.27.516.497 W slot process_toke: id  0 | task 2045 | raw tool marker observed while lazy grammar is enabled; keeping DFlash governed by active grammar boundary in_reasoning=0 n_decoded=197 reasoning_tokens=195 visible_tokens=2
629.27.618.565 I slot update_slots: id  0 | task 2045 | accepted  0/ 4 draft tokens
629.29.788.390 I slot print_timing: id  0 | task 2045 | n_decoded =    231, tg =  25.02 t/s
629.32.847.982 I slot print_timing: id  0 | task 2045 | n_decoded =    278, tg =  22.62 t/s
629.35.850.253 I slot print_timing: id  0 | task 2045 | n_decoded =    324, tg =  21.18 t/s
629.38.858.690 I slot print_timing: id  0 | task 2045 | n_decoded =    370, tg =  20.22 t/s
629.41.886.633 I slot print_timing: id  0 | task 2045 | n_decoded =    416, tg =  19.50 t/s
629.44.909.799 I slot print_timing: id  0 | task 2045 | n_decoded =    462, tg =  18.97 t/s
629.47.920.540 I slot print_timing: id  0 | task 2045 | n_decoded =    508, tg =  18.56 t/s
629.50.953.961 I slot print_timing: id  0 | task 2045 | n_decoded =    554, tg =  18.22 t/s
629.54.010.802 I slot print_timing: id  0 | task 2045 | n_decoded =    600, tg =  17.93 t/s
629.57.042.537 I slot print_timing: id  0 | task 2045 | n_decoded =    646, tg =  17.71 t/s
630.00.064.647 I slot print_timing: id  0 | task 2045 | n_decoded =    690, tg =  17.46 t/s
630.03.096.848 I slot print_timing: id  0 | task 2045 | n_decoded =    730, tg =  17.16 t/s
630.06.110.195 I slot print_timing: id  0 | task 2045 | n_decoded =    770, tg =  16.90 t/s
630.09.120.615 I slot print_timing: id  0 | task 2045 | n_decoded =    810, tg =  16.68 t/s
630.12.146.633 I slot print_timing: id  0 | task 2045 | n_decoded =    850, tg =  16.48 t/s
630.15.165.119 I slot print_timing: id  0 | task 2045 | n_decoded =    890, tg =  16.30 t/s
630.18.188.916 I slot print_timing: id  0 | task 2045 | n_decoded =    930, tg =  16.14 t/s
630.21.232.674 I slot print_timing: id  0 | task 2045 | n_decoded =    970, tg =  15.99 t/s
630.24.273.563 I slot print_timing: id  0 | task 2045 | n_decoded =   1010, tg =  15.85 t/s
630.27.286.564 I slot print_timing: id  0 | task 2045 | n_decoded =   1050, tg =  15.73 t/s
630.30.316.985 I slot print_timing: id  0 | task 2045 | n_decoded =   1090, tg =  15.62 t/s
630.33.352.460 I slot print_timing: id  0 | task 2045 | n_decoded =   1130, tg =  15.52 t/s
630.36.387.337 I slot print_timing: id  0 | task 2045 | n_decoded =   1170, tg =  15.43 t/s
630.39.416.057 I slot print_timing: id  0 | task 2045 | n_decoded =   1210, tg =  15.34 t/s
630.42.458.631 I slot print_timing: id  0 | task 2045 | n_decoded =   1250, tg =  15.26 t/s
630.45.498.988 I slot print_timing: id  0 | task 2045 | n_decoded =   1290, tg =  15.19 t/s
630.48.547.724 I slot print_timing: id  0 | task 2045 | n_decoded =   1330, tg =  15.12 t/s
630.51.580.296 I slot print_timing: id  0 | task 2045 | n_decoded =   1370, tg =  15.05 t/s
630.54.631.960 I slot print_timing: id  0 | task 2045 | n_decoded =   1410, tg =  14.99 t/s
630.57.666.954 I slot print_timing: id  0 | task 2045 | n_decoded =   1450, tg =  14.93 t/s
631.00.723.064 I slot print_timing: id  0 | task 2045 | n_decoded =   1490, tg =  14.88 t/s
631.03.773.104 I slot print_timing: id  0 | task 2045 | n_decoded =   1530, tg =  14.82 t/s
631.06.829.912 I slot print_timing: id  0 | task 2045 | n_decoded =   1570, tg =  14.77 t/s

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions