TODO PPjoin, PPjoin+, MPjoin

general

worry about search index sizes; get very big very easily

IR models

vector space model (VSM): cosine similarity
- documents are Euclidian-normalized vectors of TF(-IDF)
- idf most commonly uses log(N / df_t)
- variants
  - tf
    - sublinear (log) scaling
    - max tf normalization: normalize tf of all doc terms by doc's max tf
      - ntf(t,d) = a + (1 - a) * tf(t,d) / tf_max(d) where a usu .4
      - avoids large swings in ntf from modest changes in tf(t,d) (eg 1 to 2)
      - unstable: change in stop word list dramatically alters weightings
      - sensitive to outlying # occurrences
      - doesn't distinguish ntf=1 meaning high skew or ~same as other terms
  - SMART notation identifies doc/query weighting/normalization schemes
  - pivoted normalized doc len: replace denom |V(d)| with...
    - a * |V(d)| + (1 - a) * piv, or
    - a * u_d + (1 - a) * piv where u_d is # unique terms
probabilistic: order by prob of relevance
- loss functions: 0/1 or with costs; rank by expected loss
- binary independence model: assume boolean term vectors
  - many simplifying assumptions; same as multivariate naive bayes
  - query time boils down to computing retrieval status value (RSV)
  - simple MLE like NB; optional $\alpha$ backoff (MAP uniform prior)
  - approximation yields the log IDF measure
- okapi BM25
  - slightly more complex, but not that many params
  - incorporates tf
  - first prob model that performed very well
- http://nlp.stanford.edu/IR-book/pdf/11prob.pdf
only impl diff among all these is scoring fn

google mustang

multi-stage ranking: first filter on each child node, send to root, rank
few hundred from each children

query understanding at bing

http://www.searchenginecaffe.com/2010/07/sigir-2010-industry-day-query.html
two modes
- Best effort retrieval: Find the most relevant results for the user query
  - Query segmentation
  - Stemming and synonym expansion
  - Term deletion
- Automated Query Reformulation: Modify the user query to produce more relevant results for the inferred intent
  - spell correction
  - term deletion
  - This takes more liberty with the user's intent
stemming: restaurant -> resturant, sf -> san francisco
- abbrevs: un -> united nations
- utilize co-click patterns to find un/united nations
term relaxation
- [what is a diploid chromosome] -> "what is a" is not important from the SE matching, it introduces noise
- [where can I get an iPhone 4] -> where is an important part of the query. Removing "where" misses the whole point of the query
- [bowker's test of change tutorial] -> test of symmetry is the correct terminology. How do you know that it is the incorrect terminology? If you relax it to "bowker's test" you get better results
key concepts
- win/loss ratio
  - wins are queries whose results improve
  - losses are queries whose results degrade
  - Related to precision, but not all valid reformulations change results
- Pre vs. Post result analysis
  - Query alternatives generated pre-results
  - Blending decisions are post results
query evaluation
- "Matching" level 0/l1/l2 -> inverted index, matching and ranking. Reduce billions to hundreds of thousands of pages. Much of the loss can occur here because it never made it into the candidate set. Assume that the other layers that use ML, etc... will bubble the correct results to the top.
- "Merge" layer L3 -> the blending with multiple queries will be brought together
- Federation layer L4 -> top layer coordinating activity
- An important component is the query planner in L3 that performs annoation and rewriting.
matching & ranking
- L0/l1 - 10^10 docs. l0 - boolean set operations, l1- ir score (a linear function over simple features like bm25, simple and fast, but not very accurate)
- L2 reranking - 10^5 docs - ML heavy lifting: 1500 features, proximity
- L3 reranking - 10^3 - federation and blending
- L4 -> 10^1
- Learning to rank Using Gradient Descent (for L2 layer)
query annotation
- NLP Query annotation: offline analysis; similar to a 'parse tree'
- Ambiguity preserving: multiple interpretations
- Backend independent: shared
- Structure & Attributes: Syntax & semantics (how to handle tree leaf nodes)
query planning
- [{un | "united nations"} jobs] -> l3-merge(l2-rank([un jobs]), l2-rank(["united nations" jobs])
- OR
- [{un| "united nations"} jobs] -> l3-cascade(threshold, l2-rank([un jobs]), l2-rank(["united nations" jobs])
- the second is less certain and conditional
design considerations
- Efficiency
  - one user query may generate multiple backend queries that are merged in L3
  - Some queries are cheaper than others
  - query reduction can improve performance
- Relevance: L3 merging has maximal information, but is costly
- Multiple query plan strategies: Depending on query analysis confidence
query analysis models
- Noisy Channel Model: argmax{P(reqire | query) } = arg max{ P(rewrite)P(query| rewrite)}
- bayes inversion
- example: spelling
  - languagel model: likelihood of the correction
  - translation model: likelihood of the error occurring
language models
- Based on Large-scale text mining
  - unigrams and N-grams (to favor common previously seen things, they make sense)
  - Probability of query term sequence
  - favor queries seen before
  - avoid nonsensical combinations
- 1T n-gram resource: see MS ngram work here in SIGIR
- Translation Models: training sets of aligned pairs (mispelling/correction; surface/stemmed)
- Query log analysis
  - session reformulation
  - coclick-> associated queries
  - manual annotation
- (missed the references, but see: Wei et al, Pang et al., Craswell)
summary
- 60-70% of queries are reformulated
- Can radically improve results
- Trade-off between relevance and efficiency
  - rewrites can be costly
  - win/loss ratio is the key
- Especially important for tail queries
  - no metadata to guide matching and ranking

implementations

sphinx: C++; mysql integration; BM25F; vik says it's nice and "tight"
lucene: the most widely used index; java
solr: search service built atop lucene
solrcloud: distributed solr
elasticsearch: solrcloud alternative (dunno if it changes lucene)
indextank: reuses parts of lucene 3; customizable scoring; pretty simple; not sure what ranking it uses
vespa: yahoo's kick-ass index
terrier: platform from U of glasgow; engineered to TREC
lemur: platform from CMU; well-regarded; also engineered to TREC but then made to do a bunch of other things

lucene

java library
solr: server around lucene
incremental updates only; removed batch updates early on
indexing speed (from homepage): 20MB/m; 1MB heap; incremental indexing same as batch; index ~20-30% of text
uses VSM model for scoring/ranking; not "academically approved"
filters w standard boolean model then ranks all matches
IndexReaders and IndexWriters
- exposes delete(id, document) and index(id, document)
- separate commit() flushes to disk
- instantiating IndexReader gets a snapshot of what's on disk
scoring TODO
GSOC11 changes lucene to accomodate modern ranking like BM25, DRF and separate matching/ranking

sphinx

c++ library and server
supports distributed index
uses BM25 (not BM25F?)
in-memory attribute map
- supports live updates
on-disk index and document store
main selling point: mysql integration

vector space model

represent docs as vectors: a dim is non-0 if corresponding term present (eg tf-idf weight)
similarity is angle btwn vectors; cosine is sufficient and easier to calc
rarity: log (1 / product of all letter probs)
- get letter probs from http://www.math.cornell.edu/~mec/2003-2004/cryptography/subs/frequencies.html
- useful when don't have corpus for IDF

standard boolean model

simplest, exact-match

evaluation measures

set-based measures: don't consider rankings
- precision, recall, F-measure are single-value metrics of entire result set
- precision: # relevant and retrieved / # retrieved
- recall: # relevant and retrieved / # relevant
- f-measure aka F1 score aka F-score: $F = 2 \frac{PR}{P+R}$ where $P$ is precision, $R$ is recall
- precision at some cutoff (precision@k aka PC): precision in the top $k$
- R-precision: precision@R where R is # known relevant docs; addresses instability of precision@k
average precision (AP): considers ranking
- mean precision over the precision-recall curve (recall is $x$ var)
- $\frac{1}{|\text{relevant}|} \sum_{r=1}^k P(r) \text{rel}(r)$, where
  - $P(r)$ is the precision at given cut-off rank (i.e. relevant & retrieved & rank<r / r)
  - $\text{rel}(r)$ is 1 if relevant, 0 otherwise (i.e. skip/discount terms for irrelevant results)
  - http://www.slideshare.net/lindabeekeeper/calculating-precision-presentation
- may use interpolated precision value to smooth out curve
mean avg precision (MAP): mean AP over multiple queries
discounted cumulative gain (DCG): uses graded relevance scale instead of bool
- insight: penalize highly relevant documents of low rank
- similar spirit to precision@k and AP but for continuous relevance
- DCG accumulated at rank p = $rel_1 + \sum_{i=2}^p \frac{rel_i}{\log_2 i}$
- intuitively: relevance of 1 + other relevances, discounted logarithmically
normalized discounted cumulative gain (NDCG)
mean reciprocal rank (MRR)
these are all non-smooth cost functions
- used to approximate, eg cross-entropy of prob that order of a result pair is correct: http://research.microsoft.com/~cburges/papers/ICML_ranking.pdf
- recent advances directly optimize these: http://research.microsoft.com/en-us/um/people/cburges/papers/LambdaRank.pdf
SIGIR10 paper from MSR comparing standard IR metrics (NDCG, MAP, PC) with clicks from A/B tests of rankings findings:
- you need 10x (cheap) click data than (expensive) manual data
- NDCG has best correlation w click data
- http://research.microsoft.com/apps/pubs/default.aspx?id=130616

learning to rank (LETOR)

features
- query-dependent: tf-idf, bm25, # matches in linking-anchors/title/body/url,
  clicks on page for this query
- query-independent: PageRank, # clicks/visits, spam score

latent semantic indexing (LSI)

uncover the underlying concepts/themes in a document by analyzing cohesively all docs in the collection, seeing what terms co-occur frequently
- allows us to retrieve documents even if the exact terms don't appear
- construct term-document matrix (eg TF-IDFs)
- then find low-rank approximation to the matrix (since orig too large); result is that some dimensions are combined and depend on more than one term, e.g. ${car),(truck),(flower)} -> {(1.3452*car + 0.2828*truck),(flower)}; mitigates synonymy problem
- uses SVD
http://www.knowledgesearch.org/lsi/lsa_definition.htm

Structured Querying of Web Text (Mike Cafarella)

ExDB: probabilistic DB for SQL-like queries over web text
data model: unary/binary predicates over objects
information extraction
- facts: uses an existing NLP system called TextRunner for extracting triples
- types: uses existing KnowItAll system, which tags PSO and constructs predicates from phrases in well-formed sentences
- synonyms
  - uses existing DIRT algo; compares degrees to which args of two preds coincide
  - eg: arg-pairs of preds invented and has-invented overlap strongly
- inclusion dependencies or troponyms: invented(x,y) \subset discovered(x,y)
  - no concrete idea, but an enhanced DIRT will probably work
- functional dependencies: queries with negation
  - best left for data mining research
query processing
- straightforward: series of joins, use top-$k$ algos
- projections: harder TODO
TODO
- rest of paper
- probabilistic query constraints

yahoo labs: contextual and display ads team

key challenges
- scale
- data sparseness
- user privacy
- imprecisely defined objectives/metrics
- different train/test set distribuions
applied research areas in display ads
- targeting: inferring user interest from historical behavior
- contextual ads: serving contextually relevant (text) ads
- GD stone: next gen display ad serving platform
- categorization: ML approaches for categorizing queries, ads, pages
news direct display (NDD) optimization
techniques
- linear: log reg, NB, SVM
- non-linear: SVM (RBF), treenet, C5.0

google adwords bidding

as an advertiser: how to figure place CPC bids
- conversion rate = # visitors who buy / # visitors
- max profitable cost per action (CPA) = $ profit from buy
  - maybe your selling price - cost to you
- value per click = value per action * conversion rate
  - this is what you should initially bid
- cost per click (CPC): actual cost from an auction; always < your max bid
- incremental cost per click (ICC) = cost of incremental clicks / number of incremental clicks
  - "incremental" means "additional," over the next-lower bid
  - ICC = how much you're paying for each additional click over the next-lower bid
  - if value per click < ICC, lower bid; else, raise bid
- http://googleblog.blogspot.com/2009/09/new-adwords-bidding-tutorial.html

HITS (jon kleinberg, cornell)

some pages are hubs/portals with many outlinks
others are topic authorities with many inlinks
good hubs point to good auths
each page has a hub score and an auth score
hub score is sum of outbound auth scores
auth score is sum of inbound hub scores
equiv to solving max eigenval problem

SPEAR (michael noll)

find expert users and quality documents
based on HITS: mutual reinforcement of users and documents
time-based: discoverers vs followers
evaluated on delicious.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

IR.page

IR.page

clicks on page for this query

Files

IR.page

Latest commit

History

IR.page

File metadata and controls

clicks on page for this query