Principles of Data Mining (Undergraduate Topics in Computer Science)
Format: PDF / Kindle (mobi) / ePub
Data Mining, the automatic extraction of implicit and potentially useful information from data, is increasingly used in commercial, scientific and other application areas.
Principles of Data Mining explains and explores the principal techniques of Data Mining: for classification, association rule mining and clustering. Each topic is clearly explained and illustrated by detailed worked examples, with a focus on algorithms rather than mathematical formalism. It is written for readers without a strong background in mathematics or statistics, and any formulae used are explained in detail.
This second edition has been expanded to include additional chapters on using frequent pattern trees for Association Rule Mining, comparing classifiers, ensemble classification and dealing with very large volumes of data.
Principles of Data Mining aims to help general readers develop the necessary understanding of what is inside the 'black box' so they can use commercial data mining packages discriminatingly, as well as enabling advanced readers or academic researchers to understand or contribute to future technical advances in the field.
Suitable as a textbook to support courses at undergraduate or postgraduate levels in a wide range of subjects including Computer Science, Business Studies, Marketing, Artificial Intelligence, Bioinformatics and Forensic Science.
0.5 and the difference between good and bad as 1. This still does not seem completely right, but may be the best we can do in practice. 3.4 Eager and Lazy Learning The Naïve Bayes and Nearest Neighbour algorithms described in Sections 3.2 and 3.3 illustrate two alternative approaches to automatic classification, known by the slightly cryptic names of eager learning and lazy learning, respectively. In eager learning systems the training data is ‘eagerly’ generalised into some representation or
the classifier a ‘default strategy’, such as always allocating them to the largest class, and that will be the approach followed for the remainder of this chapter. It could be argued that it might be better to leave unclassified instances as they are, rather than risk introducing errors by assigning them to a specific class or classes. In practice the number of unclassified instances is generally small and how they are handled makes little difference to the overall predictive accuracy. Figure
Overfitting Rules to Data Let us consider a typical rule such as IF a=1 and b = yes and z = red THEN class = OK Adding an additional term to this rule will specialise it, for example the augmented rule IF a=1 and b = yes and z = red and k = green THEN class = OK will normally refer to fewer instances than the original form of the rule (possibly the same number, but certainly no more). In contrast, removing a term from the original rule will generalise it, for example the depleted rule IF
error rate’ at a node, without using the word ‘estimated’ every time. However it is important to bear in mind that they are only estimates not the accurate values, which we have no way of knowing. Max BramerUndergraduate Topics in Computer SciencePrinciples of Data Mining2nd ed. 201310.1007/978-1-4471-4884-5_10© Springer-Verlag London 2013 10. More About Entropy Max Bramer1 (1)School of Computing, University of Portsmouth, Portsmouth, UK Abstract This chapter returns to the subject of the
number of possible subsets of I, the set of all items, which has cardinality m. There are 2 m such subsets. Of these, m have a single element and one has no elements (the empty set). Thus the number of itemsets L∪R with cardinality at least 2 is 2 m −m−1. If m takes the unrealistically small value of 20 the number of itemsets L∪R is 220−20−1=1,048,555. If m takes the more realistic but still relatively small value of 100 the number of itemsets L∪R is 2100−100−1, which is approximately 1030.