文档详情

(网络数据挖掘资料)Text Based Information Retrieval - Document Mining.ppt

发布:2017-07-07约3.68千字共37页下载文档
文本预览下载声明
Text Based Information Retrieval - Text Mining;Background;Terminology;What kind of data in Data Mining?;Knowledge Discovery;Required effort for each KDD Step;;What Is Text Mining?;Text Mining (2);Information Retrieval Indexing and retrieval of textual documents Information Extraction Extraction of partial knowledge in the text Web Mining Indexing and retrieval of textual documents and extraction of partial knowledge using the web Clustering Generating collections of similar text documents;Text Mining Application;;Information Retrieval (1);Information Retrieval (2);Classical IR System Process;铆从妈昼夹守棍悠涯牲澈啮挫蚁姐谚絮毒驭纠桅颇喝搀苦苫漳慷炔赁运蕾(网络数据挖掘资料)Text Based Information Retrieval - Document Mining(网络数据挖掘资料)Text Based Information Retrieval - Document Mining;Intelligent Information Retrieval;;Why Mine the Web?;Mining the Web;What is Web Clustering ?;Text characteristics;Text characteristics;Text mining process;Part Of Speech (pos) tagging Find the corresponding pos for each word e.g., John (noun) gave (verb) the (det) ball (noun) Word sense disambiguation Context based or proximity based Very accurate Parsing Generates a parse tree (graph) for each sentence Each sentence is a stand alone graph;Feature Generation: Bag of words;Feature selection;Given: a collection of labeled records (training set) Each record contains a set of features (attributes), and the true class (label) Find: a model for the class as a function of the values of the features Goal: previously unseen records should be assigned a class as accurately as possible A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it;Similarity Measures: Euclidean Distance if attributes are continuous Other Problem-specific Measures e.g., how many words are common in these documents;Supervised learning (classification) Supervision: The training data (observations, measurements, etc.)
显示全部
相似文档