Introduction
Relona has developed an intent-based search algorithm. By "intent-based" we mean that the algorithm delivers results that are usually better aligned with the user's search-intent. However, this algorithm does not try to analyze meaning. It uses purely statistical methods to achieve better relevance.
From a historical perspective, improvements in relevance are very difficult to achieve. Google's PageRank algorithm was a major milestone because it delivered a significant improvement in relevance through the use of link-analysis for ranking results. Though link-analysis is still useful, the web today is vastly different from the web in 1997-1998 when link-based-ranking was developed. We will explain in the next few sections why link-analysis alone is not sufficient for web-search. We will also discuss natural language querying and some reasons why we believe that it may not solve as many problems as its proponents hope.
The foundation of Relona's algorithm is a principle that is orthogonal to link-analysis. This algorithm can be used as an additional component that improves existing link-based search-engines without requiring a rewrite of their internal algorithms.
Query Space and Document Space
In 1997-1998, the web was smaller than it is today. If we think in terms of a document-space, then the population of the space was limited to a few hundred million documents. For the purpose of illustration, we will use a square box filled with tiny dots to represent the document space.
When the document space was relatively small, the size of the query-space that could be addressed using two or three keyword queries was of the same magnitude. So the coverage of the document-space by the query-space was dense. That is, almost any document could be retrieved by drafting an appropriate query.
As the web grew in size, the situation changed. Link-analysis could still retrieve the best pages on any topic, but the number of topics exploded. Queries of two or three keywords were no longer expressive enough to precisely target individual items in the document-space. In other words, the coverage of the document-space by query-space became sparse.
The next picture illustrates the sparse coverage of document-space (a box containing tiny dots) by query-space. Here query-space is represented as red blobs that cover parts of the document space.
As long as users enter short queries of two or three keywords, the query-space is not getting any bigger. But the document-space that these queries address continues to increase in size.
Why is this a problem? When the query-space does not keep up with document-space, there is a vast amount of data that cannot be retrieved by any query. What the user really wants to retrieve might be a document that is between the red-blobs, but what he/she will end up getting is a document that is directly covered by a red-blob. In other words, queries can no longer completely express a user's search-intent.
To compensate, users have started entering longer queries. Longer queries increases coverage, but the precision delivered by link-analysis begins to drop off. Yahoo reported in a 2006 analyst call that the average number of keywords in each query has steadily increased 1.2 in 1998 to 3.3 in 2006. While queries have become longer, precision has become worse - try it for yourself - enter a long query (6 or more keywords) into Yahoo and a short query on the same topic. Compare the two result sets. You will find that the shorter query almost always delivers better results than the longer query.
Relona's algorithm does not suffer from the problem of reduced relevance for longer queries. As queries get longer, the quality of results do not drop so rapidly.
Shortcomings of Natural Language Queries
It has been proposed that natural language technologies could be used to solve the problem of inadequate query-document coverage. Unfortunately, natural language processing suffers from a number of serious problems.
The first problem is that of difficulty. A search-engine based on natural language will be expected to understand all of human knowledge (aka the document-space) as well as any query that a user might enter (aka the query space). AI researchers have tried for over 40 years to achieve natural language querying - without much success.
Even if the problem of natural language processing were to be solved, it may not be enough. When we try to precisely say something using natural language, our language becomes extremely complex. So complex that we have a term for it - "legalese". In normal life, we try to be very precise only when we draft legal documents. But the precise language used in legal documents is not easy for humans to read and understand. So even if natural language querying through advances in AI were possible, the end-result might not be a better search-engine!
Query Interpolation
In everyday language, if we wished to be very precise, we don't speak legalese. Instead, we express the same idea repeatedly, in different words, until it is conveyed in an unambiguous manner. Relona's technology follows the same approach. Instead of asking the user to enter a single perfectly precise query, we ask users to enter multiple queries which have the same meaning, but are expressed in different words.
The picture here shows how we can specify a previously unreachable part of document space by using two queries. The line shows the interpolation and the circle is the newly accessible part of the document-space.
The idea of entering multiple queries is not new to users. Research indicates that already about 20-30% of users rephrase their queries and try again when they are not satisfied with the first response they get from a search-engine. Relona's algorithm simply uses these queries more intelligently - we analyze the queries together to align the results better with the user's intent.
Query Extrapolation
Relona's algorithm is not only for situations where users enter multiple queries. When users enter a single query on any topic, it is possible to guess their intent in a manner that is usually accurate.
The document-space is not a uniform distribution of documents. If we place documents of similar concepts together, then we get clusters in the document space. Larger clusters occur for more popular topics. Sparse document densities occur for less popular topics.
The problem of guessing a user's intent can be expressed on the space-diagram as the problem of finding the closest cluster (bright spots in the picture here) to any query. The yellow arrow represents the move from the query to the closest cluster.
Demo Implementation
Relona's analysis of a search-query produces a map of the relative importance of the different topics in the query and how those topics relate to each other. Ideally, it should be possible to pass this map on to an existing search-engine to get better results.
Unfortunately, the APIs exposed by the major search-engines such as Yahoo and MSN-Live do not accept any inputs about the relative importance of topics and their interconnectedness. If we are to illustrate the use of our intent-analysis on existing search-engines, we would have to do it in another way.
The demo we have prepared works around the limitations of search-engine APIs as follows: We convert the topic-data that is produced by Relona's algorithm into a set of alternative queries. These alternative-queries are different ways of expressing the query that the user has entered. When this set of alternatives is considered together, it reflects the relative importance of the different topics in the query as measured by Relona's intent-analysis algorithm.
Running these alternative queries on different search-engines illustrates the benefits of our algorithm. For example, comparison studies indicate that Yahoo is usually less relevant than Google. However, when Yahoo is augmented with Relona's intent-analysis system, it becomes more accurate than Google.
Relona has conducted detailed studies using AOL's search-logs to determine the exact benefits that this technology brings to each of Yahoo, Ask.com and MSN-Live. View the presentations below to learn more:
Relona's benefits for Yahoo
Relona's benefits for Ask.com
Relona's benefits for MSN-Live
Relona's benefits for Google
Methodology of Comparison Studies - This presentation describes how the studies comparing the relevance of Relona with Yahoo, Ask, Microsoft, and Google were conducted. View this to gain a deeper understanding of the numbers discussed in the above presentations.
Our demo emphasizes topics by using the keyword intitle: and de-emphasizes topics by dropping them from the query entirely. There are better ways to emphasize and de-emphasize topics.
One better way to emphasize topics is to look at the link-text. Link-text is the text on the links that point to any particular page. To emphasize a topic using link-text, we elevate the rank of pages whose incoming links contain that topic in the link-text. Another good way to emphasize a topic is to consider where that topic appears on a web page - whether it is in a headline, located near other keywords, or in large-fonts. Such sophisticated techniques for emphasizing and de-emphasizing topics can be used only after integrating the analysis-map within the internal data-structures of a search-engine.
Since even the simple mechanisms used in this demo produce very good results, there is reason to believe that direct integration of Relona's analysis within a search-engine will produce much better results.
View Demo
A public demo of Relona's technology is available. |