โก Quick Summary
This study evaluated the use of compact large language models (LLMs) for automating the title and abstract screening process in systematic reviews. The models demonstrated high sensitivity (up to 100%) and significant workload reduction, although they exhibited low precision, indicating a need for careful threshold selection.
๐ Key Details
- ๐ Models Assessed: GPT-4o mini, Llama 3.1 8B, Gemma 2 9B
- ๐งฉ Data Source: Records from three previously published systematic reviews
- โ๏ธ Rating System: Inclusion rating from 0 to 100 using structured prompts
- ๐ Performance Metrics: Balanced accuracy, sensitivity, specificity, positive and negative predictive value, workload-saving
- โฑ๏ธ Processing Time: GPT-4o mini (~40 minutes), Llama 3.1-8B and Gemma 2-9B (~4 hours)
- ๐ฐ Cost: GPT-4o mini ($0.14-$1.93 per review), others were free
๐ Key Takeaways
- ๐ค LLMs can significantly automate the title and abstract screening process in systematic reviews.
- ๐ High sensitivity (up to 100%) was achieved, making them effective for identifying relevant records.
- โ๏ธ Optimal thresholds (50- and 75-rating) provided the best balance between sensitivity and specificity.
- ๐ผ Workload savings were notable, allowing researchers to focus on more complex tasks.
- ๐ก Low precision (below 10%) indicates a need for careful consideration in model application.
- ๐ Potential for broader applications in systematic reviews and other research areas.
- ๐ Future research is needed to refine these models and improve their precision.

๐ Background
Systematic reviews are essential for evidence-based research, yet they are often labor-intensive, particularly during the title and abstract screening phase. The introduction of large language models (LLMs) presents an opportunity to streamline this process, potentially enhancing efficiency while maintaining accuracy.
๐๏ธ Study
This study aimed to assess the feasibility, accuracy, and workload reduction of three compact LLMs in screening titles and abstracts. Researchers sourced records from three systematic reviews and employed a structured prompt to rate each record for inclusion, allowing for a comprehensive evaluation of model performance.
๐ Results
The results indicated that the LLMs achieved impressive sensitivity levels, with the highest reaching 100%. However, the models struggled with precision, which remained below 10%. Specificity and workload savings improved at higher rating thresholds, with the 50- and 75-rating thresholds providing the best trade-offs for practical use.
๐ Impact and Implications
The findings from this study suggest that compact LLMs can play a transformative role in the systematic review process. By automating the title and abstract screening, researchers can save valuable time and resources, allowing for a more efficient review process. This advancement could lead to faster dissemination of research findings and improved evidence-based practices across various fields.
๐ฎ Conclusion
This study highlights the potential of compact LLMs in enhancing the efficiency of systematic reviews. While the high sensitivity and significant workload reduction are promising, the low precision indicates that further refinement is necessary. Continued exploration of these technologies could lead to more effective tools for researchers, ultimately improving the quality of evidence-based research.
๐ฌ Your comments
What are your thoughts on the use of LLMs in systematic reviews? Do you see potential challenges or benefits? ๐ฌ Join the conversation in the comments below or connect with us on social media:
Compact large language models for title and abstract screening in systematic reviews: An assessment of feasibility, accuracy, and workload reduction.
Abstract
Systematic reviews play a critical role in evidence-based research but are labor-intensive, especially during title and abstract screening. Compact large language models (LLMs) offer potential to automate this process, balancing time/cost requirements and accuracy. The aim of this study is to assess the feasibility, accuracy, and workload reduction by three compact LLMs (GPT-4o mini, Llama 3.1 8B, and Gemma 2 9B) in screening titles and abstracts. Records were sourced from three previously published systematic reviews and LLMs were requested to rate each record from 0 to 100 for inclusion, using a structured prompt. Predefined 25-, 50-, 75-rating thresholds were used to compute performance metrics (balanced accuracy, sensitivity, specificity, positive and negative predictive value, and workload-saving). Processing time and costs were registered. Across the systematic reviews, LLMs achieved high sensitivity (up to 100%) and low precision (below 10%) for records included by full text. Specificity and workload savings improved at higher thresholds, with the 50- and 75-rating thresholds offering optimal trade-offs. GPT-4o-mini, accessed via application programming interface, was the fastest model (~40ย minutes max.) and had usage costs ($0.14-$1.93 per review). Llama 3.1-8B and Gemma 2-9B were run locally in longer times (~4ย hours max.) and were free to use. LLMs were highly sensitive tools for the title/abstract screening process. High specificity values were reached, allowing for significant workload savings, at reasonable costs and processing time. Conversely, we found them to be imprecise. However, high sensitivity and workload reduction are key factors for their usage in the title/abstract screening phase of systematic reviews.
Author: [‘Sciurti A’, ‘Migliara G’, ‘Siena LM’, ‘Isonne C’, ‘De Blasiis MR’, ‘Sinopoli A’, ‘Iera J’, ‘Marzuillo C’, ‘De Vito C’, ‘Villari P’, ‘Baccolini V’]
Journal: Res Synth Methods
Citation: Sciurti A, et al. Compact large language models for title and abstract screening in systematic reviews: An assessment of feasibility, accuracy, and workload reduction. Compact large language models for title and abstract screening in systematic reviews: An assessment of feasibility, accuracy, and workload reduction. 2026; 17:332-347. doi: 10.1017/rsm.2025.10044