Comparing AI and Human judges: a pilot study of large language models in criminal sentencing prediction – International Journal of Comparative and Applied Criminal Justice

‘This pilot study empirically investigates the ability of Generative Artificial Intelligence (GenAI) to mimic human judges’ sentencing decisions in criminal cases by comparing sentences generated by Large Language Models (LLMs) with those handed down by human judges. Using real-world datasets, our research revealed a strong correlation between LLM-generated predictions and actual court sentences. We observed significant consistency both across multiple runs of the same model and between different models, thereby demonstrating high internal reliability and inter-model reliability. Moreover, the severity of sentences proposed by the models closely mirrored those handed down by human judges. These findings underscore the potential of LLMs to assist sentencing judges, identify inconsistencies in sentencing, and help legal actors predict case outcomes. Our findings contribute to the ongoing discussion on the influence of LLMs in the legal domain, highlighting both the potential benefits and challenges, including significant ethical considerations associated with their application. Possible applications and ethical considerations are briefly discussed.’

Link: https://doi.org/10.1080/01924036.2026.2644207