Some thoughts on evaluation Eamonn Keogh [email protected] Quick Reminder We saw the nearest neighbor algorithm. As we saw that we could use any distance function Quick Reminder We saw we could use the Euclidean distance. Mantled Howler Monkey Alouatta palliata Euclidean Distance Red Howler Monkey Alouatta seniculus

Quick Reminder We saw we could use the DTW distance Lowland Gorilla Gorilla gorilla graueri DTW Alignment Mountain Gorilla One-to-many mapping Gorilla gorilla beringei Now we can ask ourselves a few questions.. Considering only 1NN for simplicity Is the DTW distance better than Euclidean distance? Is anything better than DTW distance? If the XYZ measure better than DTW on some datasets, that is still useful, right? I can publish a paper The XYZ measure is good for classifying some datasets, maybe your dataset!

After all if you think of track and field events as datasets and people as algorithms. Then maybe Ashton is like DTW pretty good at almost everything. But, you do have people like Carl who are very good a just one or two things. There is a place in the world for Carl, may be there is a place for the XYZ algorithm. In fact, in the last ten years, a few hundred such algorithms have been published.. Ashton Eaton Carl Myerscough Can you beat 1NN-DTW? The Texas Sharpshooter Fallacy A paper in SIGMOD 2016 claims Our STS3 approach is more accurate than DTW in our suitable scenarios. They then note DTW outperforms STS3 in 79.5% cases. !?! (our emphasis) They then do a post-hoc explanation of why they think they won on 20.5% of the cases that suit them. The problem is the post-hoc analysis, this is a form of the Texas Sharpshooter

Fallacy. Below is a visual representation. This is what they show you, and you are impressed Can you beat 1NN-DTW? The Texas Sharpshooter Fallacy A paper in SIGMOD 2016 claims Our STS3 approach is more accurate than DTW in our suitable scenarios. They then note DTW outperforms STS3 in 79.5% cases. !?! (our emphasis) They then do a post-hoc explanation of why they think they won on 20.5% of the cases that suit them. The problem is the post-hoc analysis, this is a form of the Texas Sharpshooter Fallacy. Below is a visual representation. This is what they show you, and you are impressed until you realize that they shot the arrow first, and then painted the target around it! Can you beat 1NN-DTW? A good visual trick to compare algorithms on the 80 or so labeled time series in the public domain is the Texas Sharpshooter plot.

In this example, we predicted the expected improvement would be 10%, and the actual improvement obtained was 7%, pretty close! We need to do these for all 80 or so datasets. What are the possible outcomes? Actual Accuracy Gain For each dataset First you compute the baseline accuracy of the approach you hope to beat. Then you compute the expected improvement we would get using your proposed approach (at this stage, learning any parameters and settings) using only the training data. Note that the expected improvement could be negative. Then compute the actual improvement obtained (using these now hardcoded parameters and settings) by testing on the test dataset. You can plot the point {expected improvement , actual improvement } in a 2D grid, as below. 7% 10% Expected Accuracy Gain

Can you beat 1NN-DTW? 3 of 3 With a Texas Sharpshooter plot, each dataset falls into one of four possibilities. We expected an improvement and we got it! This is clearly the best case. We expected to do worse, and we did. This is still a good case, we know not to use our proposed algorithm for these datasets We expected to do worse, but we did better. This is the wasted opportunity case. We expected to do better, but actually did worse. This is the worst case. Texas Sharpshooter Plot Expected Improvement: We will search over different warping window constraints, from 0% to 100%, in 1% increments, looking for the warping window size that gives the highest 1NN training accuracy (if there are ties, we choose the smaller warping window size). Actual Improvement: Using the warping window size we learned in the last phase, we test the holdout test data on the training set with 1NN. Actual Accuracy Gain Now that we know how to read the plots, we will use it to

see if DTW is better than Euclidean Distance, We expected to do worse, but we did better We expected an improvement and we got it! We expected to do worse, and we did We expected to do better, but actually did worse Expected Accuracy Gain

We sometimes had difficultly in predicting when DTW would be better/worse, but many of the training sets are tiny, making such tests very difficult. For example, 51 is BeetleFy, with just 20 train and 20 test instances. Here we expected to do a little better, but we did a little worse. 2.2 78 2 82 1.8 Actual Accuracy Gain

The results are strongly supportive of the claim DTW better than Euclidean distance for most problems 76 1.6 1.4 41 40 10 42 68 52 15 In contrast, for 76 (LargeKitchenAppliances)

we had 375 train and 375 test instances, and where able to more accurately predict a large improvement. 1.2 1 65 20 709 39 27 5 38 14 29 1385 60 3 59 64

11 198 32 36 81 44 2584323 45 37 174 67 54 48 62 1884 63 242526 16 83 34 56 86 21 30 12

73 49 55 22 4 77 31 7550 17 47 61 33 46 57 66 0.8 0.8 1 7 6 28

7280 79 71 35 53 69 51 1.2 1.4 1.6 Expected Accuracy Gain 1.8

2 2.2 Can you beat 1NN-DTW? Recall the paper in SIGMOD that claimed Our STS3 approach is more accurate than DTW in our suitable scenarios. And DTW outperforms STS3 in 79.5% cases. They are claiming to be better for 1/5th of the datasets, but in essence they are only reporting one axis of the Sharpshooter plot. They did (slightly) win 20.5% of the time. They lost 79.5% of the time. Breakeven point Can you beat 1NN-DTW? Recall the paper in SIGMOD that claimed Our STS3 approach is more accurate than DTW in our suitable scenarios. And DTW outperforms STS3 in 79.5% cases. They are claiming to be better for 1/5th of the datasets, but in essence they are only reporting one axis of the Sharpshooter plot. It may be* that if we computed the Sharpshooter plot it would look like the below (for two selected points only)

Actual Accuracy Gain They did (slightly) win 20.5% of the time, but they did not predict ahead of time that they would win. They lost 79.5% of the time. Moreover, on a huge fraction of the datasets they lost on, they might have said you should use our algorithm here, we think we will win, and we would have been much worse off! Expected Accuracy Gain Lets return to our toy example The reason why people miss the flaw in the previous slides, is because they forget that they need to know ahead of time which algorithm (which person) to use. In other situations, it is obvious! If I asked you to pick someone from the below to do a high jump, who would you pick? If I asked for someone to throw a hammer? Ashton Eaton

Carl Myerscough Can you beat 1NN-DTW? Summary: I think that at least 90% of the claims to beat DTW are wrong. Of the 10% of claims that remain They are beating the simplest 1NN-DTW with w learned in a simple way. Using KNN-DTW , smoothing the data, relaxing the endpoint constraints, better methods for learning w etc, would often close some or all the gap. The improvements are so small in most cases, it takes a sophisticated and sensitive test to be sure you have a real improvement. The Texas Sharpshooter test is a great sanity check for this, but you should see the work of Anthony Bagnall and students https://arxiv.org/abs/1602.01711 for a more principled methods. Flat-tailed Horned Lizard Phrynosoma mcallii Summary DTW is an extraordinarily powerful and useful tool.

Its uses are limited only by our imaginations. We believe it will remain and important part of the data mining toolbox for decades to come. Texas Horned Lizard Phrynosoma cornutum The Texas sharpshooter plot is a recent invention It is sort of a more visual version of a contingency table, or confusion matrix ROC curves ROC = Receiver Operating Characteristic Started in electronic signal detection theory (1940s - 1950s) Has become very popular method used in machine learning/data mining applications to assess classifiers ROC curves: simplest case Consider diagnostic test for a disease

Test has two possible outcomes: positive = suggesting presence of disease negative = suggesting absence of disease An individual can test either positive or negative for the disease For positive/negative think katydid/grasshopper, or male/female, or spam/not spam etc Contingencies Warning, may be flipped in different books/papers Our Classification Was Its NOT a Heart Attack!!! Heart Attack GOLD STANDARD TRUTH Was NOT a Heart Attack

A B Was a Heart Attack C D Some Terms MODEL PREDICTED NO EVENT EVENT NO EVENT TRUE NEGATIVE

B EVENT C TRUE POSITIVE GOLD STANDARD TRUTH Some More Terms MODEL PREDICTED NO EVENT EVENT NO EVENT A

FALSE POSITIVE (Type 1 Error) EVENT FALSE NEGATIVE (Type 2 Error) D GOLD STANDARD TRUTH Accuracy What does this mean? Contingency Table Interpretation (True Positives) + (True Negatives) (True Positives) + (True Negatives) + (False Positives) + (False Negatives) Is this a good measure?

Our Classification Was Its NOT a Heart Attack GOLD STANDARD TRUTH Heart Attack!!! Was NOT a Heart Attack A B Was a Heart Attack

C D Is accuracy a good measure? Suppose that one class is very rare (True Positives) + (True Negatives) (True Positives) + (True Negatives) + (False Positives) + (False Negatives) (0) + (99,996) (0) + (99,996) + (1) + (3) Accuracy = 99.996 Our Classification Was Not bomb Our accuracy is very high, but we missed three bombs! GOLD

STAND ARD TRUTH bomb Not bomb 99,996 1 bomb 3 0 Our Classification Was Not bomb GOLD

STAND ARD TRUTH bomb Not bomb 99,996 1 bomb 3 0 Even if it means increasing this number We would like to reduce this number Specific Example without

the disease with diseas e Test Result Call these patients negative Call these patients positive Test Result Call these patients negative Call these patients positive True Positives

Test Result without the disease with the disease Call these patients negative Call these patients positive Test Result False Positives Call these patients negative Call these patients positive True negatives

Test Result Call these patients negative Call these patients positive False negatives Test Result Key Idea: Threshold Although the Test Result was some value X. If I am more afraid of false positives, I can multiple X by 1.2 to nudge it to the right. Next slide Key Idea: Threshold Although the Test Result was some value X. If I am more afraid of false positives, I can multiple X by 1.2 to nudge it to the right.

Now I get no false positives (but more false negatives) Key Idea: Threshold Although the Test Result was some value X. If I am more afraid of false negatives, I can multiple X by 0.8 to nudge it to the left. Now I get no false negatives (but more false positives) 0% 100% ROC curve True Positive Rate 100 % False Positive Rate Our Classification Was

0 % Its NOT a Heart Attack Heart Attack!!! Was NOT a Heart Attack A B Was a Heart Attack C D

ROC curve True Positive Rate 100 % 0% 0% False Positive Rate 100% ROC curve comparison A good classifier 100% 100% True Positive Rate True Positive Rate

A poor classifier 0 % 0 % False Positive Rate 100% 0 % 0 % False Positive Rate 100%

ROC curve extremes Best Test: Worst test: 100% True Positive Rate True Positive Rate 100% 0 % 0 % 0 % False Positive

Rate 100 % 0 % False Positive Rate 100 % This can happen Red: Decision Tree Blue: Nave Bayes True Positive Rate 100% 0

% 0 % False Positive Rate 100% Area under ROC curve (AUC) Overall measure of test performance For continuous data, AUC equivalent to MannWhitney U-statistic (nonparametric test of difference in location between two populations) AUC for ROC curves 100% True Positive Rate 100% 0

% True Positive Rate AUC = 100% 0 % 0 % False Positive Rate 100 % False Positive Rate

100 % 100% 100% 0 % False Positive Rate True Positive Rate AUC = 90% True Positive Rate 0 %

0 % AUC = 50% 100 % 0 % AUC = 65% 0 % False Positive Rate 100 %

But how do we adjust the threshold? Our Classification Was Not bomb GOLD STANDARD TRUTH 10 9 8 7 6 5 4 bomb Not bomb

99,996 1 bomb 3 0 If the red squares are bombs, and we want to reduce false negatives, then we can shift the line away from them 3 2 1 1 2 3

4 5 6 7 8 9 10 Our Classification Was Not bomb GOLD STANDARD TRUTH 10 9

8 7 6 5 4 3 2 1 bomb Not bomb 99,996 1 bomb 3 0

If the red squares are bombs, and we want to reduce false negatives, then we can multiply all the distances to a red square by 0.8, before finding the nearest neighbor 1 2 3 4 5 6 7 8

9 10 Our Classification Was If the red squares are bombs, and we want to reduce false negatives, then we can multiply all the probably calculations for bomb by 1.2 p(not bomb| data) = 1/3 * 3/8 3/8 p(bomb | data) = 2/5 * 5/8 * 1.2 3/8 Not bomb GOLD STANDARD TRUTH bomb

Not bomb 99,996 1 bomb 3 0 Summary For many problems, accuracy is a poor measure of how well a classifier is doing. This is especially true if one class is rare and/or the misclassification costs vary. The ROC curve allows us to mitigate this problem. Recommend Reading An introduction to ROC analysis. Tom Fawcett

Classifier Accuracy Measures The sensitivity: the percentage of correctly predicted positive data over the total number of positive data TP TP Sensitivity TP FN Positive The specificity: the percentage of correctly identified negative data over the total number of negative data. TN TN Specificity TN FP Negative The accuracy: the percentage of correctly predicted positive and negative data over the sum of positive and negative data TP TN Accuracy TP TN FP FN