文档详情

大规模数据处理与云计算-mapreduce算法设计.ppt

发布:2017-06-21约字共45页下载文档
文本预览下载声明
Pairs: Pseudo-Code * “Pairs” Analysis Advantages Easy to implement, easy to understand Disadvantages Lots of pairs to sort and shuffle around (upper bound?) Not many opportunities for combiners to work * Another Try: “Stripes” Idea: group together pairs into an associative array Each mapper takes a sentence: Generate all co-occurring term pairs For each term, emit a → { b: countb, c: countc, d: countd … } Reducers perform element-wise sum of associative arrays (a, b) → 1 (a, c) → 2 (a, d) → 5 (a, e) → 3 (a, f) → 2 a → { b: 1, c: 2, d: 5, e: 3, f: 2 } a → { b: 1, d: 5, e: 3 } a → { b: 1, c: 2, d: 2, f: 2 } a → { b: 2, c: 2, d: 7, e: 3, f: 2 } + Key: cleverly-constructed data structure brings together partial results * Stripes: Pseudo-Code * “Stripes” Analysis Advantages Far less sorting and shuffling of key-value pairs Can make better use of combiners Disadvantages More difficult to implement Underlying object more heavyweight Fundamental limitation in terms of size of event space * Cluster size: 38 cores Data Source: Associated Press Worldstream (APW) of the English Gigaword Corpus (v3), which contains 2.27 million documents (1.8 GB compressed, 5.7 GB uncompressed) * * Relative Frequencies How do we estimate relative frequencies from counts? Why do we want to do this? How do we do this with MapReduce? * The marginal is the sum of the counts of the conditioning variable co-occurring with anything else. f(B|A): “Stripes” Easy! One pass to compute (a, *) Another pass to directly compute f(B|A) a → {b1:3, b2 :12, b3 :7, b4 :1, … } * f(B|A): “Pairs” For this to work: Must emit extra (a, *) for every bn in mapper Must make sure all a’s get sent to same reducer (use partitioner) Must make sure (a, *) comes first (define sort order) Must hold state in reducer across different key-value pairs (a, b1) → 3 (a, b2) → 12 (a, b3) → 7 (a, b4) → 1 … (a, *) → 32 (a, b1) → 3 / 32 (a, b2) → 12 / 32 (a, b3) → 7 / 32 (a, b4) → 1 / 32 … Re
显示全部
相似文档