(网络数据挖掘资料)Text Based Information Retrieval - Document Mining.ppt
文本预览下载声明
Text Based Information Retrieval - Text Mining;Background;Terminology;What kind of data in Data Mining?;Knowledge Discovery;Required effort for each KDD Step;;What Is Text Mining?;Text Mining (2);Information Retrieval
Indexing and retrieval of textual documents
Information Extraction
Extraction of partial knowledge in the text
Web Mining
Indexing and retrieval of textual documents and extraction of partial knowledge using the web
Clustering
Generating collections of similar text documents;Text Mining Application;;Information Retrieval (1);Information Retrieval (2);Classical IR System Process;铆从妈昼夹守棍悠涯牲澈啮挫蚁姐谚絮毒驭纠桅颇喝搀苦苫漳慷炔赁运蕾(网络数据挖掘资料)Text Based Information Retrieval - Document Mining(网络数据挖掘资料)Text Based Information Retrieval - Document Mining;Intelligent Information Retrieval;;Why Mine the Web?;Mining the Web;What is Web Clustering ?;Text characteristics;Text characteristics;Text mining process;Part Of Speech (pos) tagging
Find the corresponding pos for each word
e.g., John (noun) gave (verb) the (det) ball (noun)
Word sense disambiguation
Context based or proximity based
Very accurate
Parsing
Generates a parse tree (graph) for each sentence
Each sentence is a stand alone graph;Feature Generation: Bag of words;Feature selection;Given: a collection of labeled records (training set)
Each record contains a set of features (attributes), and the true class (label)
Find: a model for the class as a function of the values of the features
Goal: previously unseen records should be assigned a class as accurately as possible
A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it;Similarity Measures:
Euclidean Distance if attributes are continuous
Other Problem-specific Measures
e.g., how many words are common in these documents;Supervised learning (classification)
Supervision: The training data (observations, measurements, etc.)
显示全部