A tri-axis reliability evaluation framework for large language models in high-stakes decision making: an empirical study on credit approval, clinical decision support, and compliance review

Authors

  • Daniel R. Whitfield Department of Computer Science, University of Texas at Austin, TX, USA Author

DOI:

https://doi.org/10.66372/

Keywords:

large language models, reliability evaluation, hallucination, algorithmic fairness, robustness, credit risk, clinical decision support, regulatory compliance

Abstract

The deployment of large language models (LLMs) in high-stakes decision settings such as consumer credit approval, clinical decision support and regulatory compliance review has outpaced the development of methodologically grounded reliability assessment tools. Existing benchmarks tend to focus on a single dimension—accuracy on closed-form tasks, or membership-inference resistance, or demographic parity on tabular data—while leaving a structural gap between the intuitions of practitioners in regulated industries and the metrics actually reported by model developers. This paper proposes a tri-axis reliability evaluation framework that jointly measures hallucination, fairness and robustness for LLM-driven decisions, and applies the framework empirically to three scenarios constructed on top of MIMIC-III, LendingClub and SEC EDGAR. We formalise twelve operational metrics across the three axes, define an aggregation rule with axis-level guard floors, and instantiate the framework with three open-weight models and one frontier model. Results show that no model dominates on all three axes; the model with the lowest hallucination rate exhibits the largest disparity gap on protected attributes, while the model with the strongest robustness profile under prompt perturbations is the most cautious in clinical answer generation. We discuss what the gap-pattern says about deployment policy and outline how the framework can be extended to multi-agent settings and to retrieval-augmented architectures.

 

Author Biography

  • Daniel R. Whitfield, Department of Computer Science, University of Texas at Austin, TX, USA

     

     

Downloads

Published

2026-04-08

How to Cite

A tri-axis reliability evaluation framework for large language models in high-stakes decision making: an empirical study on credit approval, clinical decision support, and compliance review. (2026). Journal of Global Engineering Review, 4(1). https://doi.org/10.66372/