CodeFuse-DeBench is an automated benchmark framework for evaluating decompiled binaries across three stages:
- Step 1: Readability
- Step 2: Syntactic Correctness / Recompilation
- Step 3: Semantic Fidelity
This repository retains the core implementation, benchmark sources, build artifacts, decompiler outputs, and three main results trees for reproducibility and secondary analysis. Paper sources, internal maintenance tools, handoff notes, and non-essential analysis scripts have been removed.
Operational identifiers such as bindebench/, binbench-*.yaml, and BINBENCH_* are intentionally retained in paths, commands, and environment variables for backward compatibility.
The table below provides a snapshot of the five evaluated decompilers across the three dimensions.
| Decompiler | Readability | Recompilability | Functionality |
|---|---|---|---|
| IDA | 5.73 (#1) | 64.8% (#2) | 29.7% (#1) |
| Ghidra | 5.50 (#2) | 65.5% (#1) | 22.8% (#2) |
| BinaryAI | 4.99 (#3) | 47.2% (#4) | 14.8% (#3) |
| RetDec | 4.51 (#4) | 50.2% (#3) | 1.5% (#5) |
| Angr | 4.36 (#5) | 38.0% (#5) | 9.2% (#4) |
Readability = mean of the L1-L5 overview scores; Recompilability = Full Success (FS) rate; Functionality = program-level Exact Stdout + Partial rate.
bindebench/
├── src/ # benchmark source corpus
├── build/ # original binaries and successful_builds.json
├── decompiled/ # outputs from each decompiler
├── evaluator/ # Step1 / Step2 / Step3 implementations
├── scripts/ # build, single-task, batch, and support scripts
├── config/ # LLM configuration templates with env-based keys
├── prompt/ # Step1 prompt assets
├── results_glm_v4_full/ # GLM results tree
├── results_qwen_v4_full/ # Qwen results tree
├── results_minimax_v4_full/ # MiniMax results tree
├── docs/ # core documentation
├── binbench-*.yaml # Lima/VM configuration
└── README.md
For a more detailed structure overview, see docs/PROJECT_STRUCTURE.md. For script-specific guidance, see scripts/README.md.
The repository does not contain real API keys. Export the required environment variables, then adjust config/llm_config.json and config/llm_key_inventory.json as needed.
Example:
export BINBENCH_GLM_API_KEY=...
export BINBENCH_DASHSCOPE_API_KEY=...
export BINBENCH_MINIMAX_API_KEY=...See docs/LLM_CONFIGURATION_GUIDE.md for details.
podman build -t cross-compiler -f scripts/Dockerfile .
podman run --platform linux/amd64 --rm -v "$(pwd):/work" cross-compiler \
python3 scripts/build_in_docker.pypython3 scripts/run_single_task.py \
src/7.c \
decompiled/retdec_out/arm32/7/7_gcc_O2_no_g.c \
--arch arm32 \
--original-bin build/arm32/7/7_gcc_O2_no_g \
--llm-profile qwen3.5-plus \
--results-dir runs/qwen_demoThis command runs Step1 on the host, then enters the matching Lima instance for Step2 and Step3.
The recommended batch entrypoint is the launcher:
python3 scripts/launch_auto_eval.py \
--llm-profile glm_official \
--arch arm64 \
--results-dir results_glm_v4_full \
--retryCall chain:
launch_auto_eval.py
-> auto_eval.py
-> host orchestration helpers (scripts/pipeline_host.py)
-> host Step1 (evaluator/readability/eval_readability.py)
-> guest Step2/3 (scripts/run_pipeline_in_docker.py)
python3 scripts/auto_eval.py \
--arch arm32 \
--src 7 \
--bin-name 7_gcc_O2_no_g \
--decompiler retdec \
--llm-profile qwen3.5-plus \
--results-dir runs/qwen_batchpython3 scripts/auto_eval.py \
--arch arm32 \
--src 7 \
--bin-name 7_gcc_O2_no_g \
--decompiler retdec \
--llm-profile minimax \
--results-dir runs/minimax_batchNotes:
scripts/pipeline_host.pyis an internal shared helper rather than a user-facing CLI. Bothrun_single_task.pyandauto_eval.pyuse it for host Step1 execution, guest preflight, and host-to-guest environment forwarding.- Filtered
auto_eval.pyinvocations are useful for validating a single task before widening the scope. For larger runs, preferlaunch_auto_eval.pyorauto_eval.pywith--retry.
results_{llm}_v4_full/is the main results tree. Step1, Step2, and Step3 share the same per-task directory.- Historical Step1-only outputs have already been merged into the
readability/subdirectories of the three main results trees. - The three results trees are large, and
decompiled/also contains full decompiler outputs. This repository is intended for reproducibility and auditability rather than as a lightweight demo.
- docs/PROJECT_STRUCTURE.md: repository layout and results tree structure
- docs/PIPELINE_USAGE.md: single-task pipeline usage
- docs/AUTO_EVAL_IMPLEMENTATION.md: batch orchestration entrypoint
- docs/LLM_CONFIGURATION_GUIDE.md: profile templates and key injection
- docs/READABILITY_EVALUATION.md: Step1 metrics and outputs
- docs/STEP2_METRICS.md: Step2 metrics and outputs
- docs/SEMANTIC_EVALUATION_DETAILS.md: Step3 implementation details and artifacts
CodeFuse-DeBench is licensed under the Apache License 2.0.