# Lecture outline Classification Decision-tree classification What is classification?

Lecture outline Classification Decision-tree classification What is classification? What is classification? Classification is the task of learning a target function f that maps attribute set x to one of the predefined class labels y What is classification?

Why classification? The target function f is known as a classification model Descriptive modeling: Explanatory tool to distinguish between objects of different classes (e.g., description of who can pay back his loan) Predictive modeling: Predict a class of a previously unseen record Typical applications credit approval target marketing

medical diagnosis treatment effectiveness analysis General approach to classification Training set consists of records with known class labels Training set is used to build a classification model The classification model is applied to the test set that consists of records with unknown labels General approach to classification

Evaluation of classification models Actual Class Counts of test records that are correctly (or incorrectly) predicted by the classification model Predicted Class Class = 1 Class = 0 Confusion matrix Class = 1

f f Class = 0 11 10 f01 f00 f11 f 00

# correct predictions Accuracy total # of predictions f11 f10 f 01 f 00 Error rate f10 f 01 # wrong predictions total # of predictions f11 f10 f 01 f 00 Supervised vs. Unsupervised Learning

Supervised learning (classification) Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set Unsupervised learning (clustering) The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data Decision Trees Decision tree A flow-chart-like tree structure

Internal node denotes a test on an attribute Branch represents an outcome of the test Leaf nodes represent class labels or class distribution Decision tree generation consists of two phases Tree construction At start, all the training examples are at the root Partition examples recursively based on selected attributes Tree pruning Identify and remove branches that reflect noise or outliers Use of decision tree: Classifying an unknown sample Test the attribute values of the sample against the decision tree

Training Dataset age <=30 <=30 3140 >40 >40 >40 3140 <=30 <=30 >40

<=30 3140 3140 >40 income student credit_rating high no fair high no excellent high

no fair medium no fair low yes fair low yes excellent low yes excellent medium no

fair low yes fair medium yes fair medium yes excellent medium no excellent high yes fair medium

no excellent buys_computer no no yes yes yes no yes no yes

yes yes yes yes no Output: A Decision Tree for buys_computer age? <=30 30..40 overcast

student? yes >40 credit rating? no yes

excellent fair no yes no yes Constructing decision trees

Exponentially many decision trees can be constructed from a given set of attributes Finding the most accurate tree is NP-hard In practice: greedy algorithms Grow a decision tree by making a series of locally optimum decisions on which attributes to use for partitioning the data Constructing decision trees: the Hunts algorithm Xt: the set of training records for node t y={y1,,yc}: class labels Step 1: If all records in Xt belong to the same class yt,

then t is a leaf node labeled as yt Step 2: If Xt contains records that belong to more than one class, select attribute test condition to partition the records into smaller subsets Create a child node for each outcome of test condition Apply algorithm recursively for each child Decision-tree construction (Example) Design issues How should the training records be split?

How should the splitting procedure stop? Splitting methods Binary attributes Splitting methods Nominal attributes Splitting methods Ordinal attributes Splitting methods Continuous attributes

Selecting the best split p(i|t): fraction of records belonging to class i Best split is selected based on the degree of impurity of the child nodes Class distribution (0,1) has high purity Class distribution (0.5,0.5) has the smallest purity (highest impurity) Intuition: high purity small value of impurity measures better split Selecting the best split

Selecting the best split: Impurity measures p(i|t): fraction of records associated with node t belonging to class i c Entropy(t ) p(i | t ) log p(i | t ) i 1 c Gini(t ) 1

p(i | t ) 2 i 1 Classification error (t ) 1 max i p (i | t ) Range of impurity measures Impurity measures In general the different impurity measures are

consistent Gain of a test condition: compare the impurity of the parent node with the impurity of the child nodes I ( parent ) k j 1 N (v j )

N I (v j ) Maximizing the gain == minimizing the weighted average impurity measure of children nodes If I() = Entropy(), then info is called information gain Computing gain: example Is minimizing impurity/ maximizing enough?

Is minimizing impurity/ maximizing enough? Impurity measures favor attributes with large number of values A test condition with large number of outcomes may not be desirable # of records in each partition is too small to make predictions Gain ratio Gain ratio = info/Splitinfo SplitInfo = -i=1kp(vi)log(p(vi)) k: total number of splits

If each attribute has the same number of records, SplitInfo = logk Large number of splits large SplitInfo small gain ratio Constructing decision-trees (pseudocode) GenDecTree(Sample S, Features F) 1. If stopping_condition(S,F) = true then a.

leaf = createNode() b. leaf.label= Classify(S) c. return leaf 2. root = createNode()

3. root.test_condition = findBestSplit(S,F) 4. V = {v| v a possible outcome of root.test_condition} 5. for each value vVV: a.

Sv: = {s | root.test_condition(s) = v and s V S}; b. child = TreeGrowth(Sv ,F) ; c. Add child as a descent of root and label the edge (rootchild) as v 6. return root

Stopping criteria for tree induction Stop expanding a node when all the records belong to the same class Stop expanding a node when all the records have similar attribute values Early termination Advantages of decision trees

Inexpensive to construct Extremely fast at classifying unknown records Easy to interpret for small-sized trees Accuracy is comparable to other classification techniques for many simple data sets Example: C4.5 algorithm

Simple depth-first construction. Uses Information Gain Sorts Continuous Attributes at each node. Needs entire data to fit in memory. Unsuitable for Large Datasets. You can download the software from: http://www.cse.unsw.edu.au/~quinlan/c4.5r8.tar.gz Practical problems with classification Unerfitting and overfitting Missing values

Cost of classification Underfitting and overfitting 500 circular and 500 triangular data points. Circular points: 0.5 sqrt(x12+x22) 1 Triangular points: sqrt(x12+x22) >1 or sqrt(x12+x22) < 0.5 Overfitting and underfitting

Underfitting: when model is too simple, both training and test errors are large Overfitting due to noise Decision boundary is distorted by noise point Overfitting due to insufficient samples Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region - Insufficient number of training records in the region causes the decision tree to predict the test examples using other training records that are irrelevant to the classification task

Overfitting: course of action Overfitting results in decision trees that are more complex than necessary Training error no longer provides a good estimate of how well the tree will perform on previously unseen records Need new ways for estimating errors Methods for estimating the error Re-substitution errors: error on training ( e(t) ) Generalization errors: error on testing ( e(t)) Methods for estimating generalization errors:

Optimistic approach: e(t) = e(t) Pessimistic approach: For each leaf node: e(t) = (e(t)+0.5) Total errors: e(T) = e(T) + N 0.5 (N: number of leaf nodes) For a tree with 30 leaf nodes and 10 errors on training (out of 1000 instances): Training error = 10/1000 = 1% Generalization error = (10 + 300.5)/1000 = 2.5%

Reduced error pruning (REP): uses validation data set to estimate generalization error Addressing overfitting: Occams razor Given two models of similar generalization errors, one should prefer the simpler model over the more complex model For complex models, there is a greater chance that it was fitted accidentally by errors in data Therefore, one should include model complexity when evaluating a model

Addressing overfitting: postprunning Grow decision tree to its entirety Trim the nodes of the decision tree in a bottom-up fashion If generalization error improves after trimming, replace sub-tree by a leaf node. Class label of leaf node is determined from majority class of instances in the sub-tree Can use MDL for post-pruning Addressing overfitting:

preprunning Stop the algorithm before it becomes a fullygrown tree Typical stopping conditions for a node: Stop if all instances belong to the same class Stop if all the attribute values are the same More restrictive conditions: Stop if number of instances is less than some user-specified threshold Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain). Decision boundary for decision trees

1 0.9 x < 0.43? 0.8 Yes 0.7 No

0.6 y < 0.33? y < 0.47? 0.5 0.4 Yes 0.3 0.2

:4 :0 0.1 0 0 0.1 0.2

0.3 0.4 0.5 x 0.6 0.7 0.8

0.9 No :0 :4 Yes :0 :3 1

Border line between two neighboring regions of different classes is known as decision boundary Decision boundary in decision trees is parallel to axes because test condition involves a single attribute at-a-time No :4 :0 Oblique Decision Trees x+y<1 Class = +

Test condition may involve multiple attributes More expressive representation Not all datasets can be partitioned optimally using Finding optimal test condition is computationally expensive conditions involving single attributes! Class = test

## Recently Viewed Presentations

• Time and effort Vs. End product. Any tangible items are secondary. FAR 37 Service Contracting - Helpful - not controlling. Routine Maintenance Vs. Repair. Principally for service can include other labor laws, Labor Standards for Contracts Involving Construction & Contracts...
• 4. Protein Use. a. Proteins include enzymes that control metabolic rates, clotting factors, the keratin of skin & hair, elastin and collagen of connective tissue, plasma proteins that regulate water balance, the muscle components actin and myosin, hormones, and the...
• Heart of Worship When the music fades and all is stripped away, I simply come Longing just to bring Something that's of worth That will bless Your heart
• Show Shrek clip * Quiz on _____. ... analysis, structured, formal Examples: Essay Research project Allegory (adj. Allegorical) The use of fictional characters and actions to represent truths about human nature Is an "extended analogy" Two levels of meaning-- the...
• Atrial (Bainbridge) reflex: a sympathetic reflex initiated by increased venous return. Stretch of the atrial walls stimulates the SA node. Also stimulates atrial stretch receptors activating sympathetic reflexes
• Writing Skills 3. Writing to inform and Newspaper articles. ... we can use CONNECTIVES or PUNCTUATION. Joining clauses and simple sentences. CONNECTIVES. But. However. Because. ... (might be more than one paragraph) Para 2. The effect it would have on...
• Intellectual Property IP primer. Sandra Shumaker, Office of Sponsored Programs. Edward Lamoureux, Ph.D., Professor, Slane College of Communications and Fine Arts
• : statistical analyses and mathematical modeling from serial national biobehavioral surveys of PWID, comprehensive 2015 national survey of PWID and 2011 national survey of prisoners to assess the population attributable fraction (PAF) of incarceration on new HIV infections in PWID...