CES Leaderboard

This leaderboard evaluate the abilities of LLMs in simulating program execution

🙏 Please cite our paper if you are using CES in your work 🙏

@article{liu2025assessing,
  title={Assessing Coherency and Consistency of Code Execution Reasoning by Large Language Models},
  author={Liu, Changshu and Chen, Yang and Jabbarvand, Reyhaneh},
  journal={arXiv preprint arXiv:2510.15079},
  year={2025}
}

Reasoning Coherency

# Ranking	LLM	Coherent Reasoning & Correct Output (%) ▾	Coherent Reasoning & Incorrect Output (%) ▾	Incoherent Reasoning & Correct Output (%) ▾	Incoherent Reasoning & Incorrect Output (%) ▾

Reasoning Consistency

# Ranking	LLM	Strong Reasoning	Weak Reasoning	Random Reasoning