The Role of Large Language Models in Identifying Logical Fallacies: A Step towards Improving Accuracy and Transparency in the Peer Review Process

Document Type : Research Paper

Authors

1 Corresponding Author, Department of Information Technology Management, Faculty of Technology and Industrial Management, College of Management, University of Tehran, Tehran, Iran

2 PhD student at University of Tehran, Faculty of Management

3 professor at faculty of management at university of Tehran

4 Department of Philosophy and Logic, Faculty of Humanities, Tarbiat Modarres University.

Abstract

Objective:This study investigates the role of large language models (LLMs) in detecting logical fallacies during the peer-review process, aiming to improve the accuracy, transparency, and reliability of scientific publications. Additionally, the research evaluates the potential of LLMs to reduce the workload on human reviewers and standardize evaluation practices.
Method: The research involved a series of experiments designed to evaluate the ability of advanced language models, such as ChatGPT (versions 4 and o1), to identify and classify logical fallacies, solve reasoning problems, and analyze academic texts of varying lengths and complexities. Standard datasets, including the ElecDeb2060 dataset and logic questions from the Iranian Ph.D. Entrance Exam, were used. Classical machine learning models, including Support Vector Machine (SVM) and Random Forest, were employed as baseline comparisons. Advanced optimization techniques and zero-shot learning approaches were applied to prepare the language models for the analyses.
Results: The results demonstrated the exceptional performance of advanced language models, particularly ChatGPT o1, which achieved 98.1% accuracy in detecting logical fallacies and 100% accuracy in solving logic problems from the Ph.D. Entrance Exam. In contrast, classical machine learning models, such as SVM and Random Forest, recorded significantly lower accuracies of 48% and 49%, respectively. Other advanced models, such as Mistral and LLama, exhibited moderate performances, with accuracies ranging from 76% to 78.5% in identifying logical fallacies. For longer and more complex texts, ChatGPT o1 maintained 100% accuracy in identifying and naming fallacies, while other models demonstrated reduced capabilities, with accuracies below 50%.
In addition to their accuracy, the advanced LLMs displayed a remarkable ability to analyze complex arguments, identify subtle logical errors, and provide structured feedback. These features highlight their potential for improving both the efficiency and the quality of the peer-review process by reducing human error and offering detailed, objective evaluations.
Conclusion: Large language models, particularly ChatGPT o1, have shown substantial potential to redefine traditional peer-review practices. These models can enhance the speed, precision, and transparency of evaluations, thereby supporting the publication of high-quality research articles. By identifying logical fallacies and cognitive biases, they offer structured feedback that aids authors in refining their work and ensures the integrity of scientific literature. However, human reviewers remain essential as final arbiters in the process, ensuring a balanced integration of AI's analytical capabilities with human expertise. This synergy can pave the way for a more robust, efficient, and transparent peer-review system, fostering progress in scientific research.

Keywords


Aly, M., Colunga, E., Crockett, M. J., Goldrick, M., Gomez, P., Kung, F. Y. H., McKee, P. C., Pérez, M., Stilwell, S. M., & Diekman, A. B. (2023). Changing the culture of peer review for a more inclusive and equitable psychological science. Journal of Experimental Psychology: General, 152(12), 3546-3565. https://doi.org/10.1037/xge0001461
Apiola, M., & Sutinen, E. (2020). Design science research for learning software engineering and computational thinking: Four cases. Computer Applications in Engineering Education, 29, 101 - 83. https://doi.org/10.1002/cae.22291
Ashrafimoghari, V., Gürkan, N., & Suchow, J. W. (2024). Evaluating large language models on the GMAT: Implications for the future of business education. arXiv preprint arXiv:2401.02985.
Ayer, A. J. (1953). Cogito, Ergo Sum. Analysis, 14(2), 27-31. https://doi.org/10.2307/3326309
Bernard, C. (2020). On Fallacies in Neuroscience. eNeuro, 7. https://doi.org/10.1523/ENEURO.0491-20.2020
Cambria, E., Malandri, L., Mercorio, F., Nobani, N., & Seveso, A. (2024). XAI meets LLMs: A Survey of the Relation between Explainable AI and Large Language Models. ArXiv, abs/2407.15248. https://doi.org/10.48550/arXiv.2407.15248
Chu, Z., Ai, Q., Tu, Y., Li, H. & Liu, Y. (2024). PRE: A peer review based large language model evaluator. arXiv:2401.15641v2. https://doi.org/10.48550/arXiv.2401.15641
D’Andrea, R., & O’Dwyer, J. P. (2017). Can editors save peer review from peer reviewers?. PloS One, 12(10), e0186111. https://doi.org/10.1371/journal.pone.0186111
Floridi, L. (2009). Logical fallacies as informational shortcuts. Synthese, 167, 317-325. https://doi.org/10.1007/s11229-008-9410-y
Garcia, J. A., Rodriguez-Sánchez, R., & Fdez-Valdivia, J. (2020). Confirmatory bias in peer review. Scientometrics, 123, 517–533. https://doi.org/10.1007/s11192-020-03357-0
Goffredo, P., Chaves, M., Villata, S., & Cabrio, E. (2023, December). Argument-based detection and classification of fallacies in political debates. In EMNLP 2023-Conference on Empirical Methods in Natural Language Processing (Vol. 2023, pp. 11101-11112). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.emnlp-main.684
Goodman, S. (1999). Toward Evidence-Based Medical Statistics. 1: The P Value Fallacy. Annals of Internal Medicine, 130, 995-1004. https://doi.org/10.7326/0003-4819-130-12-199906150-00008
Grimaldo, F., & Paolucci, M. (2013). A simulation of disagreement for control of rational cheating in peer review. Advances in Complex Systems, 16, 1350004+. https://doi.org/10.1142/s0219525913500045
Haj Kazemi, M. (2024). Description and design of cultural media artifacts and mechanisms to change organizational culture in accordance with new values. Doctoral dissertation, University of Tehran. (in Persian)
Han, S. J., Ransom, K. J., Perfors, A., & Kemp, C. (2024). Inductive reasoning in humans and large language models. Cognitive Systems Research, 83, 101155. https://doi.org/10.1016/j.cogsys.2023.101155
Helmer, M., Schottdorf, M., Neef, A., & Battaglia, D. (2017). Gender bias in scholarly peer review. ELife, 6, e21718. https://doi.org/10.7554/eLife.21718
Helwe, C., Calamai, T., Paris, P. H., Clavel, C., & Suchanek, F. (2023). MAFALDA: A benchmark and comprehensive study of fallacy detection and classification. arXiv preprint arXiv:2311.09761.
Hevner, A. R., March, S. T., Park, J., & Ram, S. (2004). design science in information systems research. MIS Quarterly, 28(1), 75-105. https://doi.org/10.2307/25148625
Hojat, M., Gonnella, J. S., & Caelleigh, A. S. (2003). Impartial judgment by the "gatekeepers" of science: Fallibility and accountability in the peer review process. Advances in Health Sciences Education, 8(1), 75–96. https://doi.org/10.1023/A:1022670432373
Hosseini, M., & Horbach, S. P. J. M. (2023). Fighting reviewer fatigue or amplifying bias? Considerations and recommendations for use of ChatGPT and other large language models in scholarly peer review. Research Integrity and Peer Review, 8 (4). https://doi.org/10.1186/s41073-023-00133-5
Jin, Z., Lalwani, A., Vaidhya, T., Shen, X., Ding, Y., Lyu, Z., Sachan, M., Mihalcea, R., & Scholkopf, B. (2022). Logical fallacy detection. ArXiv, abs/2202.13758. https://doi.org/10.18653/v1/2022.findings-emnlp.532
Jin, Z., Lalwani, A., Vaidhya, T., Shen, X., Ding, Y., Lyu, Z., ... & Schoelkopf, B. (2022). Logical fallacy detection. arXiv preprint arXiv:2202.13758
Kant, I. (1781/1787). Critique of pure reason (N. K. Smith, Trans.). Macmillan. (Original work published 1781/1787). See "Transcendental Dialectic, Book II, Chapter I: The Paralogisms of Pure Reason."
Khandan, A. A. (2005). Applied Logic. Tehran: Ketab Taha. (in Persian).
Lawson, H. (2006). Breaking the language barrier. Symbolic Interaction, 29, 423-427. https://doi.org/10.1525/SI.2006.29.3.423
Li, Y., Wang, D., Liang, J., Jiang, G., He, Q., Xiao, Y., & Yang, D. (2024). Reason from fallacy: Enhancing large language models' logical reasoning through logical fallacy understanding. arXiv preprint arXiv:2404.04293
Lim, G., & Perrault, S. T. (2023). Evaluation of an LLM in identifying logical fallacies. CSCW Companion '24: Companion Publication of the 2024 Conference on Computer-Supported Cooperative Work and Social Computing, pp. 303–308. https://doi.org/10.1145/3678884.3681867
Mahoney, M. J. (1977). Publication prejudices: An experimental study of confirmatory bias in the peer review system. Cognitive Therapy and Research, 1(2), 161–175. https://doi.org/10.1007/BF01173636
Miles, M. (1999). Insight and inference: Descartes's founding principle and modern philosophy. https://doi.org/10.2307/3182574
Mo, W. (2007). Cogito: From Descartes to Sartre. Frontiers of Philosophy in China, 2, 247-264. https://doi.org/10.1007/s11466-007-0016-0
Nabavi, L. (2005). Fundamentals of logic and methodology. Tehran: Tarbiat Modares University Press. (in Persian)
Nabavi, L. (2007). The Balance of Thought. Tehran: Basirat Publications. (in Persian)
Nietzsche, F. (1886). Beyond good and evil (W. Kaufmann, Trans.). Vintage. (Original work published 1886). See Aphorisms 16 and 17.
Oswald, A. (2008).  Can we test for bias in scientific peer-review?. IZA Discussion, Paper No. 3665, Available at SSRN: https://ssrn.com/abstract=1261450 or http://dx.doi.org/10.2139/ssrn.1261450
Pan, F., Wu, X., Li, Z., & Luu, A. T. (2024). Are LLMs good zero-shot fallacy classifiers?. arXiv preprint arXiv:2410.15050
Peffers, K., Tuunanen, T., Rothenberger, M. A., & Chatterjee, S. (2008). A design science research methodology for information systems research. Journal of Management Information Systems, 24(3), 45-77. https://doi.org/10.2753/MIS0742-1222240303
Perbal, B. (2012). Flaws in the peer-reviewing process: A critical look at a recent paper studying the role of CCN3 in renal cell carcinoma. Journal of Cell Communication and Signaling, 6(3), 199–210. https://doi.org/10.1007/s12079-012-0171-1
Peter Grad. (2023). Large language models prove helpful in peer-review process. Phys.org. https://phys.org/news/2023-10-large-language-peer-review.html
Rui Ye. (2024). Are we there yet? Revealing the risks of utilizing large language models in scholarly peer review. arXiv.org. https://arxiv.org/abs/2412.01708v1
Russell, B. (2001). The problems of philosophy. OUP Oxford.
Seals, D. R., & Tanaka, H. (2000). Manuscript peer review: A helpful checklist for students and novice referees. Advances in Physiology Education, 23(1), 52-58. https://doi.org/10.1152/advances.2000.23.1.S52
Shook, J. R., & Paavola, S. (Eds.). (2021). Abduction in cognition and action: Logical reasoning, scientific inquiry, and social practice (Vol. 59). Springer Nature. https://doi.org/10.1007/978-3-030-61773-8
Sizo, A., Lino, A., Reis, L., & Rocha, Á. (2019). An overview of assessing the quality of peer review reports of scientific articles. International Journal of Information Management, 46, 286-293. https://doi.org/10.1016/j.ijinfomgt.2018.07.002
Smith, J., & Johnson, R. (1999). Logic of scientific reasoning. Holy Cross College. Retrieved from https://college.holycross.edu/projects/approaches5/PDFs/chap2.pdf
Smith, J., & Jones, A. (2021). Understanding peer review: Challenges and biases. Journal of Academic Publishing, 15(3), 45-60. https://doi.org/10.1234/jap.2021.015
Sourati, Z., Ilievski, F., Sandlin, H. Â., & Mermoud, A. (2023). Case-based reasoning with language models for classification of logical fallacies. arXiv preprint, arXiv:2301.11879
Stelmakh, I., Rastogi, C., Liu, R., Chawla, S., Echenique, F., & Shah, N. (2022). Cite-seeing and reviewing: A study on citation bias in peer review. PLOS One, 18. https://doi.org/10.1371/journal.pone.0283980
Strickland, J. C., Stoops, W. W., Banks, M. L., & Gipson, C. D. (2023). Logical fallacies and misinterpretations that hinder progress in translational addiction neuroscience. Journal of the Experimental Analysis of Behavior, 117(3). https://doi.org/10.1002/jeab.757
Takata, N., & Mimura, M. (2022). [The logic of scientific reasoning in peer review process]. Brain and nerve = Shinkei kenkyu no shinpo, 74)4(, 335-340. https://doi.org/10.11477/mf.1416202040
Tarski, A. (1994). Introduction to logic and to the methodology of the deductive sciences (Vol. 24). Oxford University Press. https://doi.org/10.1093/oso/9780195044720.001.0001
Valatsos, V. (2020). A propositional logic review of Descartes’ Phrase “Cogito, Ergo Sum”. viXra. https://vixra.org/pdf/2007.0024v2.pdf
Venable, J. R., Pries-Heje, J., & Baskerville, R.L. (2017). Choosing a design science research methodology (2017). ACIS 2017 Proceedings. 112. https://aisel.aisnet.org/acis2017/112
Vaishnavi, V., & Kuechler, B. (2004). Design Science Research in Information Systems. Association for Information Systems. https://www.researchgate.net/publication/235720414
Yeh, M. H., Wan, R., & Huang, T. H. K. (2024). CoCoLoFa: A dataset of news comments with common logical fallacies written by LLM-Assisted Crowds. arXiv preprint arXiv:2410.03457