文档详情

lecture1-intro.ppt

发布:2017-08-14约1.51万字共47页下载文档
文本预览下载声明
Indexer steps: Dictionary Postings Multiple term entries in a single document are merged. Split into Dictionary and Postings Doc. frequency information is added. Why frequency? Will discuss later. Sec. 1.2 Where do we pay in storage? * Pointers Terms and counts Later in the course: How do we index efficiently? How much storage do we need? Sec. 1.2 Lists of docIDs The index we just built How do we process a query? Later - what kinds of queries can we process? * Today’s focus Sec. 1.3 Query processing: AND Consider processing the query: Brutus AND Caesar Locate Brutus in the Dictionary; Retrieve its postings. Locate Caesar in the Dictionary; Retrieve its postings. “Merge” the two postings: * 128 34 2 4 8 16 32 64 1 2 3 5 8 13 21 Brutus Caesar Sec. 1.3 The merge Walk through the two postings simultaneously, in time linear in the total number of postings entries * 34 128 2 4 8 16 32 64 1 2 3 5 8 13 21 128 34 2 4 8 16 32 64 1 2 3 5 8 13 21 Brutus Caesar 2 8 If the list lengths are x and y, the merge takes O(x+y) operations. Crucial: postings sorted by docID. Sec. 1.3 Intersecting two postings lists (a “merge” algorithm) * Boolean queries: Exact match The Boolean retrieval model is being able to ask a query that is a Boolean expression: Boolean Queries are queries using AND, OR and NOT to join query terms Views each document as a set of words Is precise: document matches condition or not. Perhaps the simplest model to build an IR system on Primary commercial retrieval tool for 3 decades. Many search systems you still use are Boolean: Email, library catalog, Mac OS X Spotlight * Sec. 1.3 Example: WestLaw / Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992) Tens of terabytes of data; 700,000 users Majority of users still use boolean queries Example query: Information on the legal theories involved in preventing the disclosure of trade secrets by employees formerly employed by a competing company “trade secret” /s disclos! /s
显示全部
相似文档