文档详情

《MapReduce海量数据并行处理》经典培训课件.ppt

发布：2016-09-16约2.96万字共96页下载文档

文本预览下载声明

文档倒排算法简介 Inverted Index(倒排索引)是目前几乎所有支持全文检索的搜索引擎都要依赖的一个数据结构。基于索引结构，给出一个词(term)，能取得含有这个term的文档列表(the list of documents) Web Search中的问题主要分为三部分： crawling(gathering web content) indexing(construction of the inverted index) retrieval(ranking documents given a query) crawling和indexing都是离线的，retrieval是在线、实时的 6. 文档倒排索引算法简单的文档倒排算法文档倒排索引算法基于以上索引的搜索结果： fish ? doc1, doc2 red ? doc2, doc3 red fish ? doc2 doc1： one fish two fish doc2： red fish blue fish doc3： one red bird 倒排索引： one: doc1, doc3 fish: doc1, doc2 two: doc1 red: doc2, doc3 blue: doc2 bird: doc3 简单的文档倒排算法文档倒排索引算法 import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class InvertedIndexMapper extends MapperText, Text, Text, Text { @Override protected void map(Text key, Text value, Context context) throws IOException, InterruptedException // default RecordReader: LineRecordReader; key: line offset; value: line string { FileSplit fileSplit = (FileSplit)context.getInputSplit(); String fileName = fileSplit.getPath().getName(); Text word = new Text(); Text fileName_lineOffset = new Text(fileName+”#”+key.toString()); StringTokenizer itr = new StringTokenizer(value.toString()); for(; itr.hasMoreTokens(); ) { word.set(itr.nextToken()); context.write(word, fileName_lineOffset); } } } 改进:map输出的key除了文件名,还给出了该词所在行的偏移值: 格式：filename#offset 简单的文档倒排算法文档倒排索引算法 import java.io.IOException; import java.util.Collections; import java.util.Iterator; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class InvertedIndexReducer extends ReducerText, Text, Text, Text { @Override protected void reduce(Text key, IterableText values, Context context) throws IOException, InterruptedException { IteratorText it = values.iterator(); StringBuilder all = new StringBuilder(); if(it.hasNext()) all.append(it.next().toString()); for(; it.hasNext(); ) { all.append(“;);

显示全部

相似文档