2016-10-16T18:23:47

Relevant Search: With applications for Solr and Elasticsearch

Here are my notes from book Relevant Search: With applications for Solr and Elasticsearch by Doug Turnbull, John Berryman

These search-time ranking factors that measure what users care about are called signals.
These annotated lists of search results that are relevant with respect to a set of queries are known as judgment lists.
Relevance is the practice of improving search results for users by satisfying their information needs in the context of a particular user experience, while balancing how ranking impacts our business’s needs.
Doc frequency
Term frequency
Term positions
Term offsets
Payloads
these steps are extraction, enrichment, analysis, and indexing.
Analysis is composed of three steps: character filtering, tokenization, and token filtering.
map & sort & map -> transformation
two primary internal layers key to relevance: matching and ranking.
In many ways, aside from foundational concepts, relevance scoring is just as much an art as a science.
"analyzer": { "path_hierarchy": { "tokenizer": "path_hierarchy"}}}},
You can tokenize locations, melodies, and many other kinds of data, turning the search engine into a general-purpose similarity system across many kinds of data.
We’re easily tricked into seeing search as a single problem. In reality, search applications differ greatly from one another. It’s true that a typical search application lets the user enter text, filter through documents, and interact with a list of ranked results. But don’t be fooled by superficial appearances. Each application has dramatically different relevance expectations.
If your boss is breathing down your neck with a tough relevance problem, you know there’s far more to tuning search than optimizing analysis!
Although this is an OK place to start, the source data model isn’t optimized for search.
If you can get past the idea that fields exist simply to store properties of data, and embrace the idea that you can manipulate data so it can be found as users expect it, then you can begin to effectively program relevance rules into the search engine.
You’re likely to reach the heat death of the universe before achieving a perfect search solution in every direction.
most_fields —Treat each match score as a clause in a Boolean query. Recall that a Boolean query is a summation, with a coordinating factor, or coord. Coord is the number of matching clauses / number of total clauses. Thus coord rewards Boolean queries with more matches:
The shingle token filter can generate tokens from two-word subphrases. This can help you build a field to match two-word names.
Thinking of most_fields as a set of Boolean SHOULD clauses helps you see how you ought to use it; these SHOULD clauses list all the criteria of the most relevant doc in terms of the signals that correspond to each field:
The takeaway is that you need to carefully tune boosts to make most_fields live up to its promise. Otherwise, with an arbitrarily strong field score, you’ll end up with unexpectedly lopsided results. Whereas boosting in best_fields declares priority on which field matches should come first in expected lopsidedness, boosting in most_fields brings balance to the summed terms to restore a more blended score of weighted fields.
Given your ignorance of field synchronicity, your earlier work was a bit of a false start.
Continuing the mapping, you copy the cast.name field over to the people.name field. You do the same for directors. The following listing adds copy_to to the cast mapping.
In contrast, a cross_fields search is dynamic, addressing signal discordance at query time. It does this by becoming a dismax-style query parser on steroids. The ranking function of cross_fields remains identical to the query parser approach, with one important modification: the cross_fields query temporarily modifies the search term’s document frequency, field by field, before searching.
You can modify the strategy to account for reality.
At the heart of the function_score_query is the base query query parameter —another full-fledged Query DSL query. Being combined with that query’s relevance score are mathematical functions
Instead of disabling tokenization, you’ll use a technique referred to as sentinel tokens. Sentinel tokens represent important boundary features in the text (perhaps a sentence, paragraph, or begin/end point).
This section introduces one of our favorite shaping techniques: building scoring tiers. It’s often useful to express ranking in terms of tiers based on your confidence in the information provided in the boosting signals.
In some cases, you need a way to ignore the TF × IDF scoring. To do this, you can wrap your query in a constant_score query.
The equation for this Gaussian decay is complex, but to give you an idea of how it behaves, we’ve included the graph in figure 7.11. This graph demonstrates the user’s perceived value for an actor’s or director’s film as a function of days into the past, showing that films 900 days into the past are half as valuable. As movies move five to six years into the past, the influence of this function begins to approach 0—near worthless!
Yet we’ll spare you from seeing the full thing (you can view the examples on GitHub to see the full query in all its glory).
Elasticsearch provides a match_phrase_prefix query that implements this approach.
As an additional step, to support phrase suggestions, you use a shingle filter to generate two-word phrases.
Another unfortunate consequence with using a finite state transducer to back completion is that the data structure is immutable. If a document is deleted from the index, the completions for that document will still exist. Currently, the only remedy for this situation is a full-index optimization, which effectively rebuilds the completion index from scratch.
For each term in genres.name’s global term dictionary, the aggregation returns the number of documents with that term.
But to soften the best_fields behavior a bit, you’ll also introduce a tie_breaker with a value of 0.3.
For the sake of simplicity, you encode the scoring logic in a script score to give promoted restaurants the highest boost, followed by restaurants that have available discounts, followed by the nonpaying but highly engaged restaurants.
Listing 9.6. Composite query that incorporates all of the major signals
With a well-structured query in hand, you have one final task: tuning the weights so that content, user, and business signals are properly balanced. But as simple as this task may sound, it’s often the most frustrating and time-consuming job in all of relevance work.
The important part is to have a small set of sample requests that exercise every typical use case.
There are many potential sources of behavioral information, but here we list just a few to get you thinking about the possibilities:
We refer to this style of work as iterative and fail fast.
A user-focused culture recognizes several sources of feedback. First, it recognizes that domain experts in your organization can help you correct the direction of relevance work. Yet corrective feedback goes beyond these experts and becomes a full-time job for relevance engineers. You’ll see that one role, which we call the content curator, becomes the high priest of search feedback. The content curator takes command, examining user behavioral data and the broader business, to understand search correctness. Ultimately, you’ll see how you can speed up feedback in the form of test-driven relevancy. As discussed in the previous chapter, this form of feedback continuously evaluates search changes, highlighting where search is taking a slide for the worse.
Looking at the big picture, you’ve gone from seeking feedback by pestering others for help to now advocating for not only a full-time role, but nearly full-time pairing. Our feedback loops are telescoping down to increasingly tighter and more immediate forms of feedback. This feedback reflects the iterative nature of relevance work.
use wishlists for personalized search
You can incentivize profile building with functionality; for example, by letting users bookmark items that they like, or share items with friends.
The following algorithm is a bit naïve; we intend it to be introductory and don’t recommend that you implement it in a production system.
As alluded to earlier, finding affinities in this way is a fairly naïve approach. For example, you haven’t normalized products that are extremely popular.
Notice that this filter doesn’t filter out any documents from the result set. Instead, documents matching this filter are given a multiplicative boost of 1.1, as indicated by the weight parameter.
under the hood geo search is implemented by ORing together many, possibly hundreds, of terms that represent a geographic area.
Collocation extraction is a common technique for identifying statistically significant phrases. The text of the documents is split into n-grams (commonly bigrams). Then statistical analysis is used to determine which n-grams occur frequently enough to be considered statistically significant. As is often the case, this analysis is a glorified counting algorithm. For instance, if your document set contains 1,000 occurrences of the term developer, and if 25% of the time the term developer is preceded by the term software, the bigram software developer should probably be marked as a significant phrase.
Maybe search is not the application you should be building. Maybe you should be building recommendations.