Friday 17th May 2024
Ho Chi Minh, Vietnam

1. Introduction

Full-text search (FTS) is a technique for searching a collection of documents for relevant results based on the presence of certain words or phrases. FTS is a branch of information retrieval related to text mining and natural language processing.

FTS queries can be used to search for multiple words at once and can return ranked results that match multiple words which makes it more powerful than a normal SQL query. The most popular full-text search engines are Apache Lucene, Apache Solr, Elasticsearch, and Sphinx.

How can we know if our full-text search system returns accurate results? And what needs to be achieved when building a full-text search system? Evaluating a full-text search system involves assessing its performance, relevance, and usability. Many metrics can be used to evaluate a full-text search system such as speed, confusion matrix, and query-specific evaluation but the most important metrics are precision and recall.

The value range of precision and recall is 0 – 1. The goal of an FTS system is to achieve the maximum threshold of these two metrics. Before digging into precision and recall we need to know two types of relevance in FTS.

2. Two types of relevance

FTS is about returning relevant results to end users from a large number of documents and ranking them. There are 2 types of relevance: user relevance and system relevance.

User relevance is determined by how well the search results align with the user’s information needs and expectations. For example, if a user is searching for information on “artificial intelligence” user-relevant results would include documents that are truly about artificial intelligence from the user’s perspective.

System relevance refers to the relevance of search results as determined by the search algorithm or system itself. It is an objective measure of how well the system’s algorithms match documents to the user’s query. For example, in the context of a full-text search system, if a user queries for “machine learning” system-relevant results would be documents that the search algorithm determines as being most relevant based on its indexing and ranking mechanisms.

We can represent these two types via the image below.

Let S = number of documents returned by the system, U = number of documents the user needs

3. Precision

Precision measures the proportion of relevant documents retrieved among all the documents retrieved for a specific query. In simpler terms, precision indicates how many of the retrieved documents are actually relevant to the user’s query.

For instance, if a search engine retrieves 20 documents in response to a query and 12 of those documents are relevant to the user’s needs, the precision would be 12/20=0.62012​=0.6 or 60%.

Precision 100% means all documents returned by the system are included in the documents the user needs.

Average Precision (AP) for a single query:

  • For each query, calculate the precision at each relevant document’s rank position.
  • Average these precision values only for the relevant documents. The formula for AP is:

4. Recall

Recall measures the proportion of relevant documents retrieved out of all the documents that were actually relevant to the query. In simpler terms, recall tells us how many of the relevant documents the system was able to retrieve among all the relevant documents.

For instance, if there are 50 relevant documents for a specific query, and the search system retrieves 30 of those documents, the recall would be 3050=0.65030​=0.6 or 60%.

Precision 100% means the system can return all documents the user needs.

5. Mean average precision

Precision and recall can only evaluate a single query, to evaluate an FTS system, we have to base on multiple queries. MAP calculates the average precision across multiple queries or searches and then computes the mean of these average precision scores.

Once the Average Precision (AP) is calculated for each query, take the mean (average) of these AP values across all queries. If there are Q queries, then MAP is calculated as:

MAP is particularly useful in scenarios where ranking of search results matters. It accounts for both precision (how many retrieved documents are relevant) and the ranking of these relevant documents. A higher MAP indicates a better-performing system in providing relevant and well-ranked results across multiple queries.

It’s a widely used metric in evaluating the effectiveness of search engines and information retrieval systems, especially when dealing with datasets where ranking plays a critical role in user satisfaction.

6. Trec dataset

Creating a dataset to evaluate an FTS system should take a great effort to create the pairs: query and user relevance documents. We can use several public datasets are commonly used to evaluate FTS systems. The choice of dataset depends on the specific goals of the evaluation, the type of search system being tested, and the desired characteristics of the dataset.

One of the popular datasets to evaluate FTS is TREC. TREC provides various collections of documents and queries specifically designed for information retrieval evaluations. Datasets like TREC Robust, TREC 9-10, and TREC GOV2 contain large collections of documents with associated relevance judgments.

7. Conclusion

In conclusion, a successful full-text search system should strike a balance between recall and precision, ensuring a comprehensive retrieval of relevant information while minimizing irrelevant results. Furthermore, a higher MAP underscores the system’s proficiency in consistently delivering accurately ranked and relevant documents across various queries, ultimately enhancing user experience and system reliability. Evaluating these metrics collectively provides a robust framework for assessing and optimizing full-text search systems to meet user needs effectively.

Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top