Does an LLM-Based Feedback Agent Move the Needle? An Empirical Study of Student Correctness and Hint Dependency on the ASSISTments 2017 Interaction Logs
DOI:
https://doi.org/10.66372/JGER.v4i1.11Keywords:
LLM feedback agent; automated formative feedback; knowledge tracing; Assessment’sAbstract
Large language model (LLM) feedback agents are increasingly proposed as substitutes for system-authored hints in intelligent tutoring environments, yet most existing evaluations rely on rubric-based offline judgments or small-scale human ratings rather than learner-behaviour evidence drawn from real interaction logs. This study presents an empirical analysis of three production-grade LLM feedback agents — GPT-4o-mini, Llama-3-70B-Instruct, and Gemini-1.5-Flash — evaluated on the ASSISTments 2017 click-stream corpus, which contains 942,816 records, 1,709 students, 3,162 problems, and 102 knowledge components. Three research questions are examined: whether LLM-generated feedback raises subsequent correctness, whether it reduces hint dependency, and whether automatic feedback-quality scores correlate with observed behavioural change. The evaluation pipeline pairs Prometheus and G-Eval as offline judges with a context-aware attentive knowledge-tracing counterfactual predictor, allowing each intervention to be scored both for textual quality and for predicted learning trajectory. Aggregate results show a moderate average correctness gain of 4.8 percentage points for GPT-4o-mini over the system-authored hint baseline, concentrated in conceptual knowledge components. Hint-request rates drop by 11.4 percent under GPT-4o-mini, while Llama-3-70B and Gemini-1.5 produce smaller and less stable shifts. The correlation between feedback-quality score and behavioural gain is moderate (Spearman ρ = 0.41, 95% CI [0.34, 0.47]), indicating that high rubric scores partially but not fully explain learner impact. Findings provide cautious empirical support for LLM feedback agents as supplements to template hints in middle-school mathematics tutoring and highlight open questions around procedural-knowledge components and ability-stratified responsiveness.

