Conceptual and abstract digital generated image of multiple AI chat icons hovering over a digital surface
Getty Images/J Studios

Research reveals potential bias in Large Language Models’ text relevance assessments

Author ADM+S Centre
Date 14 March 2025

A recent study has uncovered significant concerns surrounding the use of Large Language Models (LLMs) to assess the relevance of information, particularly in passage labelling tasks.

This research investigates how LLMs label passages of text as “relevant” or “non-relevant,” raising new questions about the accuracy and reliability of these models in real-world applications, especially when they are used to train ranking systems or replace humans for relevance assessment.

The study, which received the “Best Paper Honorable Mention” at the SIGIR-AP Conference on Information Retrieval in Tokyo in December 2024, compares the relevance labels produced by various open-source and proprietary LLMs with human judgments.

It finds that, while some LLMs agree with human assessors at similar levels of human-to-human agreement as measured in past research, they are more likely to label passages as relevant. This suggests that while LLMs’ “non-relevant” labels are generally reliable, their “relevant” labels may not be as dependable.

Marwah Alaofi, a PhD student at the ARC Centre of Excellence for Automated Decision-Making and Society, supervised by Prof Mark Sanderson, Prof Falk Scholer, and Paul Thomas, conducted the study as part of her research into measuring the reliability of LLMs for creating relevance labels.

“Our study highlights a critical blind spot in how Large Language Models (LLMs) assess document relevance to user queries,” said Marwah.

This discrepancy, the research finds, is often due to LLMs being fooled by the presence of the user query terms within the labelled passages, even if the passage is unrelated to the query or even random.

“We found that LLMs are likely to overestimate relevance, influenced by the mere presence of query words in documents, and can be easily misled into labelling irrelevant or even random passages as relevant.”

The research suggests that in production environments, LLMs might be vulnerable to keyword stuffing and other SEO strategies, which are often used to promote the relevance of web pages.

“This raises concerns about their use in replacing human assessors for evaluating and training search engines. These limitations could be exploited through keyword stuffing and other Search Engine Optimization (SEO) strategies to manipulate rankings.”

This study underscores the critical need to go beyond the traditional evaluation metrics to better assess the reliability of LLMs in relevance assessment.

SEE ALSO