Evaluating rStar: Robustness in Small Language Model Reasoning 🧠

Apr, 2025

Monte Carlo Tree Search
Small Language Models
NLP
DTU

This project focused on testing how well the rStar framework (a reasoning system for small language models using the Monte carlo tree search) holds up when the input is slightly changed. We used a subset of the GSM8K dataset and introduced variations like swapping names, changing numbers, or adding irrelevant info to see if the model could still solve the problems accurately.

While rStar handled simple changes like name swaps pretty well, its performance dropped significantly with more disruptive variations, especially when numerical values were altered or distracting details were added. We also tested a new custom action that aimed to remove irrelevant information before reasoning — with mixed results.

The goal wasn't just to see where it fails, but to understand the limits of small language models and how frameworks like rStar might be improved to handle more realistic, messy inputs.

Made together with Jone Steinhoff, Panagiota Emmanouilidi, Petr B. Nylander, and Robert Spralja

Check out the code on GitHub.

The Prompted One gameplay screenshot