CodeFuse-DeBench

CodeFuse-DeBench is an automated benchmark framework for evaluating decompiled binaries across three stages:

Step 1: Readability
Step 2: Syntactic Correctness / Recompilation
Step 3: Semantic Fidelity

This repository retains the core implementation, benchmark sources, build artifacts, decompiler outputs, and three main results trees for reproducibility and secondary analysis. Paper sources, internal maintenance tools, handoff notes, and non-essential analysis scripts have been removed.

Operational identifiers such as bindebench/, binbench-*.yaml, and BINBENCH_* are intentionally retained in paths, commands, and environment variables for backward compatibility.

Benchmark Snapshot

The table below provides a snapshot of the five evaluated decompilers across the three dimensions.

Decompiler	Readability	Recompilability	Functionality
IDA	5.73 (#1)	64.8% (#2)	29.7% (#1)
Ghidra	5.50 (#2)	65.5% (#1)	22.8% (#2)
BinaryAI	4.99 (#3)	47.2% (#4)	14.8% (#3)
RetDec	4.51 (#4)	50.2% (#3)	1.5% (#5)
Angr	4.36 (#5)	38.0% (#5)	9.2% (#4)

Readability = mean of the L1-L5 overview scores; Recompilability = Full Success (FS) rate; Functionality = program-level Exact Stdout + Partial rate.

Repository Overview

bindebench/
├── src/                     # benchmark source corpus
├── build/                   # original binaries and successful_builds.json
├── decompiled/              # outputs from each decompiler
├── evaluator/               # Step1 / Step2 / Step3 implementations
├── scripts/                 # build, single-task, batch, and support scripts
├── config/                  # LLM configuration templates with env-based keys
├── prompt/                  # Step1 prompt assets
├── results_glm_v4_full/     # GLM results tree
├── results_qwen_v4_full/    # Qwen results tree
├── results_minimax_v4_full/ # MiniMax results tree
├── docs/                    # core documentation
├── binbench-*.yaml          # Lima/VM configuration
└── README.md

For a more detailed structure overview, see docs/PROJECT_STRUCTURE.md. For script-specific guidance, see scripts/README.md.

Quick Start

1. Configure LLM Credentials

The repository does not contain real API keys. Export the required environment variables, then adjust config/llm_config.json and config/llm_key_inventory.json as needed.

Example:

export BINBENCH_GLM_API_KEY=...
export BINBENCH_DASHSCOPE_API_KEY=...
export BINBENCH_MINIMAX_API_KEY=...

See docs/LLM_CONFIGURATION_GUIDE.md for details.

2. Build the Original Binaries

podman build -t cross-compiler -f scripts/Dockerfile .
podman run --platform linux/amd64 --rm -v "$(pwd):/work" cross-compiler \
  python3 scripts/build_in_docker.py

3. Run the Full Pipeline for a Single Task

python3 scripts/run_single_task.py \
  src/7.c \
  decompiled/retdec_out/arm32/7/7_gcc_O2_no_g.c \
  --arch arm32 \
  --original-bin build/arm32/7/7_gcc_O2_no_g \
  --llm-profile qwen3.5-plus \
  --results-dir runs/qwen_demo

This command runs Step1 on the host, then enters the matching Lima instance for Step2 and Step3.

4. Batch Evaluation

The recommended batch entrypoint is the launcher:

python3 scripts/launch_auto_eval.py \
  --llm-profile glm_official \
  --arch arm64 \
  --results-dir results_glm_v4_full \
  --retry

Call chain:

launch_auto_eval.py
  -> auto_eval.py
    -> host orchestration helpers (scripts/pipeline_host.py)
      -> host Step1 (evaluator/readability/eval_readability.py)
    -> guest Step2/3 (scripts/run_pipeline_in_docker.py)

5. Example Batch Commands

python3 scripts/auto_eval.py \
  --arch arm32 \
  --src 7 \
  --bin-name 7_gcc_O2_no_g \
  --decompiler retdec \
  --llm-profile qwen3.5-plus \
  --results-dir runs/qwen_batch

python3 scripts/auto_eval.py \
  --arch arm32 \
  --src 7 \
  --bin-name 7_gcc_O2_no_g \
  --decompiler retdec \
  --llm-profile minimax \
  --results-dir runs/minimax_batch

Notes:

scripts/pipeline_host.py is an internal shared helper rather than a user-facing CLI. Both run_single_task.py and auto_eval.py use it for host Step1 execution, guest preflight, and host-to-guest environment forwarding.
Filtered auto_eval.py invocations are useful for validating a single task before widening the scope. For larger runs, prefer launch_auto_eval.py or auto_eval.py with --retry.

Results Layout

results_{llm}_v4_full/ is the main results tree. Step1, Step2, and Step3 share the same per-task directory.
Historical Step1-only outputs have already been merged into the readability/ subdirectories of the three main results trees.
The three results trees are large, and decompiled/ also contains full decompiler outputs. This repository is intended for reproducibility and auditability rather than as a lightweight demo.

Documentation Index

docs/PROJECT_STRUCTURE.md: repository layout and results tree structure
docs/PIPELINE_USAGE.md: single-task pipeline usage
docs/AUTO_EVAL_IMPLEMENTATION.md: batch orchestration entrypoint
docs/LLM_CONFIGURATION_GUIDE.md: profile templates and key injection
docs/READABILITY_EVALUATION.md: Step1 metrics and outputs
docs/STEP2_METRICS.md: Step2 metrics and outputs
docs/SEMANTIC_EVALUATION_DETAILS.md: Step3 implementation details and artifacts

License

CodeFuse-DeBench is licensed under the Apache License 2.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CodeFuse-DeBench

Benchmark Snapshot

Repository Overview

Quick Start

1. Configure LLM Credentials

2. Build the Original Binaries

3. Run the Full Pipeline for a Single Task

4. Batch Evaluation

5. Example Batch Commands

Results Layout

Documentation Index

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
build		build
config		config
decompiled		decompiled
docs		docs
evaluator		evaluator
prompt		prompt
results_glm_v4_full		results_glm_v4_full
results_minimax_v4_full		results_minimax_v4_full
results_qwen_v4_full		results_qwen_v4_full
scripts		scripts
src		src
.gitignore		.gitignore
LEGAL.md		LEGAL.md
LICENSE		LICENSE
README.md		README.md
binbench-arm32.yaml		binbench-arm32.yaml
binbench-x64.yaml		binbench-x64.yaml
binbench-x86.yaml		binbench-x86.yaml

Folders and files

Latest commit

History

Repository files navigation

CodeFuse-DeBench

Benchmark Snapshot

Repository Overview

Quick Start

1. Configure LLM Credentials

2. Build the Original Binaries

3. Run the Full Pipeline for a Single Task

4. Batch Evaluation

5. Example Batch Commands

Results Layout

Documentation Index

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages