Reasoning in LLMs

Alessandro Gentili
abstract

This project extends previous work by our colleagues Merola, Sigh, and Dardouri on evaluating LLM performance in planning tasks and developing a meta-network for reliability prediction. The proposed extensions focus on two directions: (1) exploring emergent reasoning capabilities across a broader range of open-source large language models, including Mistral, LLaMA, and DeepSeek, to assess whether planning abilities generalize across architectures; and (2) enriching the benchmark dataset with new, challenging planning problems of graded complexity. Together, these extensions aim to deepen our understanding of LLM reasoning, improve evaluation reliability, and foster reproducible benchmarks for the research community.

outcomes