Evaluating the Quality of Large Language Model-Generated Explanations in Recommendation Tasks: A Multi-Dimensional Comparative Analysis
DOI:
https://doi.org/10.66372/JGER.v3i1.4Keywords:
large language models, recommendation explanation, evaluation framework, prompt strategyAbstract
The integration of large language models (LLMs) into recommendation systems has introduced new possibilities for generating natural language explanations that accompany recommended items. While prior research has explored methods of leveraging LLMs for explanation generation, limited attention has been given to systematically evaluating the quality of these explanations across different models, domains, and prompting strategies. This paper presents a multi-dimensional comparative analysis of LLM-generated recommendation explanations, examining four commercially available and open-source LLMs across three public recommendation datasets. A structured evaluation framework is proposed that encompasses four quality dimensions: faithfulness, informativeness, persuasiveness, and personalization. The evaluation employs both automatic metrics (BLEU-4, ROUGE-L, BERTScore) and human annotation protocols involving 12 trained evaluators. The experimental results indicate that larger-parameter models produce more informative and faithful explanations, though the gap narrows substantially when context-enhanced prompting strategies are applied. Automatic metrics show moderate correlation with human judgments on informativeness and faithfulness but limited alignment on persuasiveness and personalization. These findings offer practical guidance for selecting appropriate LLMs and prompting strategies in explanation-augmented recommendation applications.

