Who wrote this? Evaluating the reliability of AI detection tools in higher education – International Journal for Educational Integrity

‘The widespread and increasing adoption of generative artificial intelligence (GenAI) in higher education has raised urgent questions about academic integrity and the reliability of AI detection tools used in high-stakes assessments. This study compares the accuracy of four popular detection tools: GPTZero, Pangram, Copyleaks, and Turnitin on four kinds of academic papers: fully human-written, fully AI-written, hybrid (human with GenAI-inserted passages), and humanised GenAI (AI-generated passages were humanised using a prompt designed to resemble possible student behaviour). Using a synthetic dataset of 160 documents with known ground truth values, we assessed each tool’s detection accuracy. Results show that Pangram consistently performed better than the other tools, achieving high accuracy in detecting fully AI-generated, hybrid, and humanised texts. In contrast, the other tools significantly underestimated GenAI content, particularly for texts generated with the most advanced model. All tools correctly identified fully human texts. To illustrate how the best-performing tool performs in an authentic academic context, Pangram was applied to 1,163 master’s theses submitted in academic year 2024–2025, without known ground truth. The analysis describes the distribution of Pangram’s flagging scores, with flagged cases (45.5%) typically indicating low to moderate levels of AI-associated text. False positives were rare across all tools, suggesting improvement compared to earlier studies. Findings show that while detection tools can provide useful initial flags, they should not be used as sole evidence in high-stakes decision-making but should be implemented in a broader evaluation strategy.’

Link: https://link.springer.com/article/10.1007/s40979-026-00226-w