Left logo
CES Leaderboard
Right logo
This leaderboard evaluate the abilities of LLMs in simulating program execution

πŸ™ Please cite our paper if you are using CES in your work πŸ™

@article{liu2025assessing,
  title={Assessing Coherency and Consistency of Code Execution Reasoning by Large Language Models},
  author={Liu, Changshu and Chen, Yang and Jabbarvand, Reyhaneh},
  journal={arXiv preprint arXiv:2510.15079},
  year={2025}
}

Reasoning Coherency

# Ranking LLM Coherent Reasoning & Correct Output (%) β–Ύ Coherent Reasoning & Incorrect Output (%) β–Ύ Incoherent Reasoning & Correct Output (%) β–Ύ Incoherent Reasoning & Incorrect Output (%) β–Ύ

Reasoning Consistency

# Ranking LLM Strong Reasoning Weak Reasoning Random Reasoning