Discovery in Databases or KDD (Piatesky-.pdf
文本预览下载声明
10 May 2002 Roberto Innocente 1
Data mining:
rule mining algorithms
Roberto Innocente
rinnocente@
10 May 2002 Roberto Innocente 2
Introduction /1
Data mining also known as Knowledge
Discovery in Databases or KDD (Piatesky-
Shapiro 1991), is the process of extracting
useful hidden information from very large
databases in an unsupervised manner.
10 May 2002 Roberto Innocente 3
Introduction /2
Central themes of data mining are:
Classification
Cluster analysis
Associations analysis
Outlier analysis
Evolution analysis
10 May 2002 Roberto Innocente 4
ARM /1
(association rules mining)
Formally introduced in 1993 by Agrawal,
Imielinski and Swami (AIS) in connection with
market basket analysis
Formalizes statements of the form:
What is the percentage of customers that
together with cheese buy beer ?
10 May 2002 Roberto Innocente 5
ARM /2
We have a set of items I={i1,i2,..}, and a set of transaction T={t1,t2..}. Each
transaction (like a supermarket bill) is a set of items (or better as it is called an
itemset)
If U and V are disjoint itemsets, we call support of U=V the fraction of transactions
that contain U ∪ V and we indicate this with s(U=V)
We say that an itemset is frequent if its support is greater than a chosen threshold
called minsupp.
If A and B are disjoint itemsets, we call confidence of A=B and indicate with
c(A=B), the fraction of transactions containing A that contain also B. This is also
called the Bayesian or conditional probability p(B|A).
We say that a rule is strong if its confidence is greater than a threshold called
minconf.
10 May 2002 Roberto Innocente 6
ARM /3
ARM can then be formulated as:
Given a set I of items and a set T of transactions over I,
produce in an automated manner all association rules
that are more than x% frequent and more than y%
strong.
10 May 2002 Roberto Innocente 7
ARM /4
On the right we have 6
transactions T={1,2,3,4,5,6}
on a set of 5 items
I={A,B,C,D,E}
The itemset BC is present
显示全部