Skip to content

codefuse-ai/CodeFuse-DeBench

Repository files navigation

CodeFuse-DeBench

CodeFuse-DeBench is an automated benchmark framework for evaluating decompiled binaries across three stages:

  1. Step 1: Readability
  2. Step 2: Syntactic Correctness / Recompilation
  3. Step 3: Semantic Fidelity

This repository retains the core implementation, benchmark sources, build artifacts, decompiler outputs, and three main results trees for reproducibility and secondary analysis. Paper sources, internal maintenance tools, handoff notes, and non-essential analysis scripts have been removed.

Operational identifiers such as bindebench/, binbench-*.yaml, and BINBENCH_* are intentionally retained in paths, commands, and environment variables for backward compatibility.

Benchmark Snapshot

The table below provides a snapshot of the five evaluated decompilers across the three dimensions.

Decompiler Readability Recompilability Functionality
IDA 5.73 (#1) 64.8% (#2) 29.7% (#1)
Ghidra 5.50 (#2) 65.5% (#1) 22.8% (#2)
BinaryAI 4.99 (#3) 47.2% (#4) 14.8% (#3)
RetDec 4.51 (#4) 50.2% (#3) 1.5% (#5)
Angr 4.36 (#5) 38.0% (#5) 9.2% (#4)

Readability = mean of the L1-L5 overview scores; Recompilability = Full Success (FS) rate; Functionality = program-level Exact Stdout + Partial rate.

Repository Overview

bindebench/
├── src/                     # benchmark source corpus
├── build/                   # original binaries and successful_builds.json
├── decompiled/              # outputs from each decompiler
├── evaluator/               # Step1 / Step2 / Step3 implementations
├── scripts/                 # build, single-task, batch, and support scripts
├── config/                  # LLM configuration templates with env-based keys
├── prompt/                  # Step1 prompt assets
├── results_glm_v4_full/     # GLM results tree
├── results_qwen_v4_full/    # Qwen results tree
├── results_minimax_v4_full/ # MiniMax results tree
├── docs/                    # core documentation
├── binbench-*.yaml          # Lima/VM configuration
└── README.md

For a more detailed structure overview, see docs/PROJECT_STRUCTURE.md. For script-specific guidance, see scripts/README.md.

Quick Start

1. Configure LLM Credentials

The repository does not contain real API keys. Export the required environment variables, then adjust config/llm_config.json and config/llm_key_inventory.json as needed.

Example:

export BINBENCH_GLM_API_KEY=...
export BINBENCH_DASHSCOPE_API_KEY=...
export BINBENCH_MINIMAX_API_KEY=...

See docs/LLM_CONFIGURATION_GUIDE.md for details.

2. Build the Original Binaries

podman build -t cross-compiler -f scripts/Dockerfile .
podman run --platform linux/amd64 --rm -v "$(pwd):/work" cross-compiler \
  python3 scripts/build_in_docker.py

3. Run the Full Pipeline for a Single Task

python3 scripts/run_single_task.py \
  src/7.c \
  decompiled/retdec_out/arm32/7/7_gcc_O2_no_g.c \
  --arch arm32 \
  --original-bin build/arm32/7/7_gcc_O2_no_g \
  --llm-profile qwen3.5-plus \
  --results-dir runs/qwen_demo

This command runs Step1 on the host, then enters the matching Lima instance for Step2 and Step3.

4. Batch Evaluation

The recommended batch entrypoint is the launcher:

python3 scripts/launch_auto_eval.py \
  --llm-profile glm_official \
  --arch arm64 \
  --results-dir results_glm_v4_full \
  --retry

Call chain:

launch_auto_eval.py
  -> auto_eval.py
    -> host orchestration helpers (scripts/pipeline_host.py)
      -> host Step1 (evaluator/readability/eval_readability.py)
    -> guest Step2/3 (scripts/run_pipeline_in_docker.py)

5. Example Batch Commands

python3 scripts/auto_eval.py \
  --arch arm32 \
  --src 7 \
  --bin-name 7_gcc_O2_no_g \
  --decompiler retdec \
  --llm-profile qwen3.5-plus \
  --results-dir runs/qwen_batch
python3 scripts/auto_eval.py \
  --arch arm32 \
  --src 7 \
  --bin-name 7_gcc_O2_no_g \
  --decompiler retdec \
  --llm-profile minimax \
  --results-dir runs/minimax_batch

Notes:

  • scripts/pipeline_host.py is an internal shared helper rather than a user-facing CLI. Both run_single_task.py and auto_eval.py use it for host Step1 execution, guest preflight, and host-to-guest environment forwarding.
  • Filtered auto_eval.py invocations are useful for validating a single task before widening the scope. For larger runs, prefer launch_auto_eval.py or auto_eval.py with --retry.

Results Layout

  • results_{llm}_v4_full/ is the main results tree. Step1, Step2, and Step3 share the same per-task directory.
  • Historical Step1-only outputs have already been merged into the readability/ subdirectories of the three main results trees.
  • The three results trees are large, and decompiled/ also contains full decompiler outputs. This repository is intended for reproducibility and auditability rather than as a lightweight demo.

Documentation Index

License

CodeFuse-DeBench is licensed under the Apache License 2.0.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors