TODO PPjoin, PPjoin+, MPjoin
general
- worry about search index sizes; get very big very easily
IR models
- vector space model (VSM): cosine similarity
- documents are Euclidian-normalized vectors of TF(-IDF)
- idf most commonly uses
log(N / df_t)
- variants
- tf
- sublinear (log) scaling
- max tf normalization: normalize tf of all doc terms by doc's max tf
-
ntf(t,d) = a + (1 - a) * tf(t,d) / tf_max(d)
wherea
usu .4 - avoids large swings in ntf from modest changes in
tf(t,d)
(eg 1 to 2) - unstable: change in stop word list dramatically alters weightings
- sensitive to outlying # occurrences
- doesn't distinguish ntf=1 meaning high skew or ~same as other terms
-
- SMART notation identifies doc/query weighting/normalization schemes
- pivoted normalized doc len: replace denom
|V(d)|
with...-
a * |V(d)| + (1 - a) * piv
, or -
a * u_d + (1 - a) * piv
whereu_d
is # unique terms
-
- tf
- probabilistic: order by prob of relevance
- loss functions: 0/1 or with costs; rank by expected loss
- binary independence model: assume boolean term vectors
- many simplifying assumptions; same as multivariate naive bayes
- query time boils down to computing retrieval status value (RSV)
- simple MLE like NB; optional
backoff (MAP uniform prior) - approximation yields the log IDF measure
- okapi BM25
- slightly more complex, but not that many params
- incorporates tf
- first prob model that performed very well
- http://nlp.stanford.edu/IR-book/pdf/11prob.pdf
- only impl diff among all these is scoring fn
google mustang
- multi-stage ranking: first filter on each child node, send to root, rank
- few hundred from each children
query understanding at bing
- http://www.searchenginecaffe.com/2010/07/sigir-2010-industry-day-query.html
- two modes
- Best effort retrieval: Find the most relevant results for the user query
- Query segmentation
- Stemming and synonym expansion
- Term deletion
- Automated Query Reformulation: Modify the user query to produce more relevant
results for the inferred intent
- spell correction
- term deletion
- This takes more liberty with the user's intent
- Best effort retrieval: Find the most relevant results for the user query
- stemming: restaurant -> resturant, sf -> san francisco
- abbrevs: un -> united nations
- utilize co-click patterns to find un/united nations
- term relaxation
- [what is a diploid chromosome] -> "what is a" is not important from the SE matching, it introduces noise
- [where can I get an iPhone 4] -> where is an important part of the query. Removing "where" misses the whole point of the query
- [bowker's test of change tutorial] -> test of symmetry is the correct terminology. How do you know that it is the incorrect terminology? If you relax it to "bowker's test" you get better results
- key concepts
- win/loss ratio
- wins are queries whose results improve
- losses are queries whose results degrade
- Related to precision, but not all valid reformulations change results
- Pre vs. Post result analysis
- Query alternatives generated pre-results
- Blending decisions are post results
- win/loss ratio
- query evaluation
- "Matching" level 0/l1/l2 -> inverted index, matching and ranking. Reduce billions to hundreds of thousands of pages. Much of the loss can occur here because it never made it into the candidate set. Assume that the other layers that use ML, etc... will bubble the correct results to the top.
- "Merge" layer L3 -> the blending with multiple queries will be brought together
- Federation layer L4 -> top layer coordinating activity
- An important component is the query planner in L3 that performs annoation and rewriting.
- matching & ranking
- L0/l1 - 10^10 docs. l0 - boolean set operations, l1- ir score (a linear function over simple features like bm25, simple and fast, but not very accurate)
- L2 reranking - 10^5 docs - ML heavy lifting: 1500 features, proximity
- L3 reranking - 10^3 - federation and blending
- L4 -> 10^1
- Learning to rank Using Gradient Descent (for L2 layer)
- query annotation
- NLP Query annotation: offline analysis; similar to a 'parse tree'
- Ambiguity preserving: multiple interpretations
- Backend independent: shared
- Structure & Attributes: Syntax & semantics (how to handle tree leaf nodes)
- query planning
- [{un | "united nations"} jobs] -> l3-merge(l2-rank([un jobs]), l2-rank(["united nations" jobs])
- OR
- [{un| "united nations"} jobs] -> l3-cascade(threshold, l2-rank([un jobs]), l2-rank(["united nations" jobs])
- the second is less certain and conditional
- design considerations
- Efficiency
- one user query may generate multiple backend queries that are merged in L3
- Some queries are cheaper than others
- query reduction can improve performance
- Relevance: L3 merging has maximal information, but is costly
- Multiple query plan strategies: Depending on query analysis confidence
- Efficiency
- query analysis models
- Noisy Channel Model:
argmax{P(reqire | query) } = arg max{ P(rewrite)P(query| rewrite)}
- bayes inversion
- example: spelling
- languagel model: likelihood of the correction
- translation model: likelihood of the error occurring
- Noisy Channel Model:
- language models
- Based on Large-scale text mining
- unigrams and N-grams (to favor common previously seen things, they make sense)
- Probability of query term sequence
- favor queries seen before
- avoid nonsensical combinations
- 1T n-gram resource: see MS ngram work here in SIGIR
- Translation Models: training sets of aligned pairs (mispelling/correction; surface/stemmed)
- Query log analysis
- session reformulation
- coclick-> associated queries
- manual annotation
- (missed the references, but see: Wei et al, Pang et al., Craswell)
- Based on Large-scale text mining
- summary
- 60-70% of queries are reformulated
- Can radically improve results
- Trade-off between relevance and efficiency
- rewrites can be costly
- win/loss ratio is the key
- Especially important for tail queries
- no metadata to guide matching and ranking
implementations
- sphinx: C++; mysql integration; BM25F; vik says it's nice and "tight"
- lucene: the most widely used index; java
- solr: search service built atop lucene
- solrcloud: distributed solr
- elasticsearch: solrcloud alternative (dunno if it changes lucene)
- indextank: reuses parts of lucene 3; customizable scoring; pretty simple; not sure what ranking it uses
- vespa: yahoo's kick-ass index
- terrier: platform from U of glasgow; engineered to TREC
- lemur: platform from CMU; well-regarded; also engineered to TREC but then made to do a bunch of other things
lucene
- java library
- solr: server around lucene
- incremental updates only; removed batch updates early on
- indexing speed (from homepage): 20MB/m; 1MB heap; incremental indexing same as batch; index ~20-30% of text
- uses VSM model for scoring/ranking; not "academically approved"
- filters w standard boolean model then ranks all matches
- IndexReaders and IndexWriters
- exposes delete(id, document) and index(id, document)
- separate commit() flushes to disk
- instantiating IndexReader gets a snapshot of what's on disk
- scoring TODO
- GSOC11 changes lucene to accomodate modern ranking like BM25, DRF and separate matching/ranking
sphinx
- c++ library and server
- supports distributed index
- uses BM25 (not BM25F?)
- in-memory attribute map
- supports live updates
- on-disk index and document store
- main selling point: mysql integration
vector space model
- represent docs as vectors: a dim is non-0 if corresponding term present (eg tf-idf weight)
- similarity is angle btwn vectors; cosine is sufficient and easier to calc
- rarity:
log (1 / product of all letter probs)
- get letter probs from http://www.math.cornell.edu/~mec/2003-2004/cryptography/subs/frequencies.html
- useful when don't have corpus for IDF
standard boolean model
- simplest, exact-match
evaluation measures
- set-based measures: don't consider rankings
- precision, recall, F-measure are single-value metrics of entire result set
- precision: # relevant and retrieved / # retrieved
- recall: # relevant and retrieved / # relevant
- f-measure aka F1 score aka F-score:
where is precision, is recall - precision at some cutoff (precision@k aka PC): precision in the top
- R-precision: precision@R where R is # known relevant docs; addresses instability of precision@k
- average precision (AP): considers ranking
- mean precision over the precision-recall curve (recall is
var) -
, where -
is the precision at given cut-off rank (i.e. relevant & retrieved & rank<r / r
) -
is 1 if relevant, 0 otherwise (i.e. skip/discount terms for irrelevant results) - http://www.slideshare.net/lindabeekeeper/calculating-precision-presentation
-
- may use interpolated precision value to smooth out curve
- mean precision over the precision-recall curve (recall is
- mean avg precision (MAP): mean AP over multiple queries
- discounted cumulative gain (DCG): uses graded relevance scale instead of bool
- insight: penalize highly relevant documents of low rank
- similar spirit to precision@k and AP but for continuous relevance
- DCG accumulated at rank p =
- intuitively: relevance of 1 + other relevances, discounted logarithmically
- normalized discounted cumulative gain (NDCG)
- mean reciprocal rank (MRR)
- these are all non-smooth cost functions
- used to approximate, eg cross-entropy of prob that order of a result pair is correct: http://research.microsoft.com/~cburges/papers/ICML_ranking.pdf
- recent advances directly optimize these: http://research.microsoft.com/en-us/um/people/cburges/papers/LambdaRank.pdf
- SIGIR10 paper from MSR comparing standard IR metrics (NDCG, MAP, PC) with
clicks from A/B tests of rankings findings:
- you need 10x (cheap) click data than (expensive) manual data
- NDCG has best correlation w click data
- http://research.microsoft.com/apps/pubs/default.aspx?id=130616
learning to rank (LETOR)
- features
- query-dependent: tf-idf, bm25, # matches in linking-anchors/title/body/url,
- query-independent: PageRank, # clicks/visits, spam score
latent semantic indexing (LSI)
- uncover the underlying concepts/themes in a document by analyzing cohesively
all docs in the collection, seeing what terms co-occur frequently
- allows us to retrieve documents even if the exact terms don't appear
- construct term-document matrix (eg TF-IDFs)
- then find low-rank approximation to the matrix (since orig too large);
result is that some dimensions are combined and depend on more than one
term, e.g.
${car),(truck),(flower)} -> {(1.3452*car + 0.2828*truck),(flower)}
; mitigates synonymy problem - uses SVD
- http://www.knowledgesearch.org/lsi/lsa_definition.htm
Structured Querying of Web Text (Mike Cafarella)
- ExDB: probabilistic DB for SQL-like queries over web text
- data model: unary/binary predicates over objects
- information extraction
- facts: uses an existing NLP system called TextRunner for extracting triples
- types: uses existing KnowItAll system, which tags PSO and constructs predicates from phrases in well-formed sentences
- synonyms
- uses existing DIRT algo; compares degrees to which args of two preds coincide
- eg: arg-pairs of preds
invented
andhas-invented
overlap strongly
- inclusion dependencies or troponyms: invented(x,y) \subset discovered(x,y)
- no concrete idea, but an enhanced DIRT will probably work
- functional dependencies: queries with negation
- best left for data mining research
- query processing
- straightforward: series of joins, use top-$k$ algos
- projections: harder TODO
- TODO
- rest of paper
- probabilistic query constraints
yahoo labs: contextual and display ads team
- key challenges
- scale
- data sparseness
- user privacy
- imprecisely defined objectives/metrics
- different train/test set distribuions
- applied research areas in display ads
- targeting: inferring user interest from historical behavior
- contextual ads: serving contextually relevant (text) ads
- GD stone: next gen display ad serving platform
- categorization: ML approaches for categorizing queries, ads, pages
- news direct display (NDD) optimization
- techniques
- linear: log reg, NB, SVM
- non-linear: SVM (RBF), treenet, C5.0
google adwords bidding
- as an advertiser: how to figure place CPC bids
- conversion rate = # visitors who buy / # visitors
- max profitable cost per action (CPA) = $ profit from buy
- maybe your selling price - cost to you
- value per click = value per action * conversion rate
- this is what you should initially bid
- cost per click (CPC): actual cost from an auction; always < your max bid
- incremental cost per click (ICC) = cost of incremental clicks / number of
incremental clicks
- "incremental" means "additional," over the next-lower bid
- ICC = how much you're paying for each additional click over the next-lower bid
- if value per click < ICC, lower bid; else, raise bid
- http://googleblog.blogspot.com/2009/09/new-adwords-bidding-tutorial.html
HITS (jon kleinberg, cornell)
- some pages are hubs/portals with many outlinks
- others are topic authorities with many inlinks
- good hubs point to good auths
- each page has a hub score and an auth score
- hub score is sum of outbound auth scores
- auth score is sum of inbound hub scores
- equiv to solving max eigenval problem
SPEAR (michael noll)
- find expert users and quality documents
- based on HITS: mutual reinforcement of users and documents
- time-based: discoverers vs followers
- evaluated on delicious.com