Eval bug: DFlash kills inference speed on non-DFlash inference (MTP, tool calls)

### Name and Version

beellama v0.3.2 (preview)

### Operating systems

Windows

### GGML backends

CUDA

### Hardware

RTX 5060 Ti 16 GB + Ryzen 7 7500x

### Models

MTP Qwen3.6-27B-UD-Q2_K_XL.gguf

### Problem description & steps to reproduce

When running MTP model with `--spec-type draft-mtp --spec-draft-n-max 4` DFlash settings take control on raw tool marker and reduce inference speed from 47t/s (code generation) or 32 t/s (reasoning) to 25->13 and lower (descending) t/s. The model fits VRAM with extra space available.

Prompt:

`Write a single-file HTML/JS/CSS Windows-style desktop application with six apps, a start menu, a task manager, and a console. No external dependencies. The whole desktop and each application must be ui friendly ,beautiful, fully functional, meaningful and useful, including all tabls and settings. Output to desktop.html`

Run the prompt in Kilo Code, which forces model to call "Write" tool, not to output html directly.

### First Bad Commit

_No response_

### Relevant log output

<details>
<summary>Logs</summary>

```console
629.26.759.978 I slot print_timing: id  0 | task 2045 | n_decoded =    177, tg =  28.53 t/s
629.26.863.997 I slot update_slots: id  0 | task 2045 | accepted  0/ 4 draft tokens
629.26.968.266 I slot update_slots: id  0 | task 2045 | accepted  3/ 4 draft tokens
629.27.073.847 I slot update_slots: id  0 | task 2045 | accepted  3/ 4 draft tokens
629.27.178.694 I slot update_slots: id  0 | task 2045 | accepted  3/ 4 draft tokens
629.27.296.536 I slot update_slots: id  0 | task 2045 | accepted  0/ 4 draft tokens
629.27.413.468 I reasoning-budget: deactivated (natural end)
629.27.414.263 I slot update_slots: id  0 | task 2045 | accepted  4/ 4 draft tokens
629.27.516.391 I slot update_slots: id  0 | task 2045 | accepted  0/ 4 draft tokens
629.27.516.497 W slot process_toke: id  0 | task 2045 | raw tool marker observed while lazy grammar is enabled; keeping DFlash governed by active grammar boundary in_reasoning=0 n_decoded=197 reasoning_tokens=195 visible_tokens=2
629.27.618.565 I slot update_slots: id  0 | task 2045 | accepted  0/ 4 draft tokens
629.29.788.390 I slot print_timing: id  0 | task 2045 | n_decoded =    231, tg =  25.02 t/s
629.32.847.982 I slot print_timing: id  0 | task 2045 | n_decoded =    278, tg =  22.62 t/s
629.35.850.253 I slot print_timing: id  0 | task 2045 | n_decoded =    324, tg =  21.18 t/s
629.38.858.690 I slot print_timing: id  0 | task 2045 | n_decoded =    370, tg =  20.22 t/s
629.41.886.633 I slot print_timing: id  0 | task 2045 | n_decoded =    416, tg =  19.50 t/s
629.44.909.799 I slot print_timing: id  0 | task 2045 | n_decoded =    462, tg =  18.97 t/s
629.47.920.540 I slot print_timing: id  0 | task 2045 | n_decoded =    508, tg =  18.56 t/s
629.50.953.961 I slot print_timing: id  0 | task 2045 | n_decoded =    554, tg =  18.22 t/s
629.54.010.802 I slot print_timing: id  0 | task 2045 | n_decoded =    600, tg =  17.93 t/s
629.57.042.537 I slot print_timing: id  0 | task 2045 | n_decoded =    646, tg =  17.71 t/s
630.00.064.647 I slot print_timing: id  0 | task 2045 | n_decoded =    690, tg =  17.46 t/s
630.03.096.848 I slot print_timing: id  0 | task 2045 | n_decoded =    730, tg =  17.16 t/s
630.06.110.195 I slot print_timing: id  0 | task 2045 | n_decoded =    770, tg =  16.90 t/s
630.09.120.615 I slot print_timing: id  0 | task 2045 | n_decoded =    810, tg =  16.68 t/s
630.12.146.633 I slot print_timing: id  0 | task 2045 | n_decoded =    850, tg =  16.48 t/s
630.15.165.119 I slot print_timing: id  0 | task 2045 | n_decoded =    890, tg =  16.30 t/s
630.18.188.916 I slot print_timing: id  0 | task 2045 | n_decoded =    930, tg =  16.14 t/s
630.21.232.674 I slot print_timing: id  0 | task 2045 | n_decoded =    970, tg =  15.99 t/s
630.24.273.563 I slot print_timing: id  0 | task 2045 | n_decoded =   1010, tg =  15.85 t/s
630.27.286.564 I slot print_timing: id  0 | task 2045 | n_decoded =   1050, tg =  15.73 t/s
630.30.316.985 I slot print_timing: id  0 | task 2045 | n_decoded =   1090, tg =  15.62 t/s
630.33.352.460 I slot print_timing: id  0 | task 2045 | n_decoded =   1130, tg =  15.52 t/s
630.36.387.337 I slot print_timing: id  0 | task 2045 | n_decoded =   1170, tg =  15.43 t/s
630.39.416.057 I slot print_timing: id  0 | task 2045 | n_decoded =   1210, tg =  15.34 t/s
630.42.458.631 I slot print_timing: id  0 | task 2045 | n_decoded =   1250, tg =  15.26 t/s
630.45.498.988 I slot print_timing: id  0 | task 2045 | n_decoded =   1290, tg =  15.19 t/s
630.48.547.724 I slot print_timing: id  0 | task 2045 | n_decoded =   1330, tg =  15.12 t/s
630.51.580.296 I slot print_timing: id  0 | task 2045 | n_decoded =   1370, tg =  15.05 t/s
630.54.631.960 I slot print_timing: id  0 | task 2045 | n_decoded =   1410, tg =  14.99 t/s
630.57.666.954 I slot print_timing: id  0 | task 2045 | n_decoded =   1450, tg =  14.93 t/s
631.00.723.064 I slot print_timing: id  0 | task 2045 | n_decoded =   1490, tg =  14.88 t/s
631.03.773.104 I slot print_timing: id  0 | task 2045 | n_decoded =   1530, tg =  14.82 t/s
631.06.829.912 I slot print_timing: id  0 | task 2045 | n_decoded =   1570, tg =  14.77 t/s
```
</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: DFlash kills inference speed on non-DFlash inference (MTP, tool calls) #58

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Eval bug: DFlash kills inference speed on non-DFlash inference (MTP, tool calls) #58

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions