Follow us
pubmed meta image 2
🧑🏼‍💻 Research - November 27, 2024

Evaluating large language models for selection of statistical test for research: A pilot study.

🌟 Stay Updated!
Join Dr. Ailexa’s channels to receive the latest insights in health and AI.

⚡ Quick Summary

This pilot study evaluated the effectiveness of large language models (LLMs) in recommending appropriate statistical tests for research, comparing their suggestions to those of human experts. The results indicated that LLMs like ChatGPT and Microsoft Bing Chat achieved over 85% concordance with expert recommendations, highlighting their potential as decision support tools in research.

🔍 Key Details

  • 📊 Case Vignettes: 27 scenarios based on published literature
  • 🤖 LLMs Evaluated: ChatGPT3.5, Google Bard, Microsoft Bing Chat, Perplexity
  • 🔍 Evaluation Metrics: Concordance and acceptance rates
  • 📈 Statistical Analysis: Intra-class correlation coefficient and test-retest reliability

🔑 Key Takeaways

  • 🤖 LLMs showed >75% concordance in suggesting statistical tests.
  • 💡 Acceptance rates were >95% across all models evaluated.
  • 📊 ChatGPT3.5 had the highest concordance at 85.19%.
  • 🏆 Microsoft Bing Chat achieved 96.3% concordance with 100% acceptance.
  • 🔄 Test-retest reliability varied significantly among models.
  • 🌐 LLMs can enhance efficiency in statistical test selection.
  • 🧠 They are not a replacement for human expertise but serve as valuable support tools.
  • 📅 Study published in the journal Perspect Clin Res.

📚 Background

Selecting the appropriate statistical test is a crucial yet often challenging aspect of research methodology. Traditional methods rely heavily on human expertise, which can be time-consuming and prone to error. The advent of large language models (LLMs) presents a promising opportunity to automate this process, potentially improving both the efficiency and accuracy of statistical test selection.

🗒️ Study

This study aimed to assess the capabilities of freely available LLMs, including OpenAI’s ChatGPT3.5, Google Bard, Microsoft Bing Chat, and Perplexity, in recommending suitable statistical tests for various research scenarios. A total of 27 case vignettes were prepared, and the models’ recommendations were compared to those made by human experts to evaluate their effectiveness.

📈 Results

The findings revealed that among the 27 case vignettes, ChatGPT3.5 achieved an impressive 85.19% concordance and 100% acceptance. Microsoft Bing Chat performed exceptionally well with 96.3% concordance and 100% acceptance. The intra-class correlation coefficient among the LLM responses was 0.728, indicating a moderate level of agreement among the models.

🌍 Impact and Implications

The results of this study suggest that LLMs can significantly enhance the process of selecting statistical tests in research settings. By providing rapid and reliable recommendations, these models can serve as effective decision support systems, particularly in situations where timely test selection is critical. This advancement could lead to improved research outcomes and more efficient use of resources in various scientific fields.

🔮 Conclusion

This pilot study highlights the potential of large language models in assisting researchers with statistical test selection. While they do not replace human expertise, LLMs like ChatGPT and Microsoft Bing Chat demonstrate a high level of concordance and acceptance, making them valuable tools in the research process. Continued exploration of LLM capabilities could further enhance their utility in academic and clinical research.

💬 Your comments

What are your thoughts on the use of large language models in research? Do you see them as a valuable tool or a potential challenge to traditional expertise? 💬 Share your insights in the comments below or connect with us on social media:

Evaluating large language models for selection of statistical test for research: A pilot study.

Abstract

BACKGROUND: In contemporary research, selecting the appropriate statistical test is a critical and often challenging step. The emergence of large language models (LLMs) has offered a promising avenue for automating this process, potentially enhancing the efficiency and accuracy of statistical test selection.
AIM: This study aimed to assess the capability of freely available LLMs – OpenAI’s ChatGPT3.5, Google Bard, Microsoft Bing Chat, and Perplexity in recommending suitable statistical tests for research, comparing their recommendations with those made by human experts.
MATERIALS AND METHODS: A total of 27 case vignettes were prepared for common research models with a question asking suitable statistical tests. The cases were formulated from previously published literature and reviewed by a human expert for their accuracy of information. The LLMs were asked the question with the case vignettes and the process was repeated with paraphrased cases. The concordance (if exactly matching the answer key) and acceptance (when not exactly matching with answer key, but can be considered suitable) were evaluated between LLM’s recommendations and those of human experts.
RESULTS: Among the 27 case vignettes, ChatGPT3.5-suggested statistical test had 85.19% concordance and 100% acceptance; Bard experiment had 77.78% concordance and 96.3% acceptance; Microsoft Bing Chat had 96.3% concordance and 100% acceptance; and Perplexity had 85.19% concordance and 100% acceptance. The intra-class correction coefficient of average measure among the responses of LLMs was 0.728 (95% confidence interval [CI]: 0.51-0.86), P < 0.0001. The test-retest reliability of ChatGPT was r = 0.71 (95% CI: 0.44-0.86), P < 0.0001, Bard was r = -0.22 (95% CI: -0.56-0.18), P = 0.26, Bing was r = -0.06 (95% CI: -0.44-0.33), P = 0.73, and Perplexity was r = 0.52 (95% CI: 0.16-0.75), P = 0.0059.
CONCLUSION: The LLMs, namely, ChatGPT, Google Bard, Microsoft Bing, and Perplexity all showed >75% concordance in suggesting statistical tests for research case vignettes with all having acceptance of >95%. The LLMs had a moderate level of agreement among them. While not a complete replacement for human expertise, these models can serve as effective decision support systems, especially in scenarios where rapid test selection is essential.

Author: [‘Mondal H’, ‘Mondal S’, ‘Mittal P’]

Journal: Perspect Clin Res

Citation: Mondal H, et al. Evaluating large language models for selection of statistical test for research: A pilot study. Evaluating large language models for selection of statistical test for research: A pilot study. 2024; 15:178-182. doi: 10.4103/picr.picr_275_23

Share on facebook
Facebook
Share on twitter
Twitter
Share on linkedin
LinkedIn
Share on whatsapp
WhatsApp

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.