- 1. Machine Learning Machine Learning 1
- 2. Machine Learning Machine Learning Up until now: how to reason in a give model Machine learning: how to acquire a model on the basis of data / experience ◦Learning parameters (e.g. probabilities) ◦Learning structure (e.g. BN graphs) ◦Learning hidden concepts (e.g. clustering) 2
- 3. Machine Learning Lingo Machine Learning Lingo 3
- 4. Occam’s Razor Occam’s Razor 4 Commonly attributed to William of Ockham (1290-1349). This was formulated about fifteen hundred years after Epicurus. ◦In sharp contrast to the principle of multiple explanations, it states: Entities should not be multiplied beyond necessity. Commonly explained as: when have choices, choose the simplest theory. Bertrand Russell: “It is vain to do with more what can be done with fewer.”
- 5. Supervised Machine Learning Supervised Machine Learning Given a training set: (x1, y1), (x2, y2), (x3, y3), … (xn, yn) Where each yi was generated by an unknown y = f (x), Discover a function h that approximates the true function f. 5
- 6. Classification Example: Spam Filter Classification Example: Spam Filter Input: x = email Output: y = “spam” or “ham” Setup: ◦Get a large collection of example emails, each labeled “spam” or “ham” ◦Note: someone has to hand label all this data! ◦Want to learn to predict labels of new, future emails Features: The attributes used to make the ham / spam decision ◦Words: FREE! ◦Text Patterns: $dd, CAPS ◦Non-text: SenderInContacts ◦… Dear Sir. First, I must solicit your confidence in this transaction, this is by virture of its nature as being utterly confidencial and top secret. … TO BE REMOVED FROM FUTURE MAILINGS, SIMPLY REPLY TO THIS MESSAGE AND PUT "REMOVE" IN THE SUBJECT. 99 MILLION EMAIL ADDRESSES FOR ONLY $99 Ok, I know this is blatantly OT but I'm beginning to go insane. Had an old Dell Dimension XPS sitting in the corner and decided to put it to use, I know it was working pre being stuck in the corner, but when I plugged it in, hit the power nothing happened. 6
- 7. A Spam Filter A Spam Filter Naïve Bayes spam filter Data: ◦Collection of emails, labeled spam or ham ◦Note: someone has to hand label all this data! ◦Split into training, held- out, test sets Classifiers ◦Learn on the training set ◦(Tune it on a held-out set) ◦Test it on new emails Dear Sir. First, I must solicit your confidence in this transaction, this is by virture of its nature as being utterly confidencial and top secret. … TO BE REMOVED FROM FUTURE MAILINGS, SIMPLY REPLY TO THIS MESSAGE AND PUT "REMOVE" IN THE SUBJECT. 99 MILLION EMAIL ADDRESSES FOR ONLY $99 Ok, Iknow this is blatantly OT but I'm beginning to go insane. Had an old Dell Dimension XPS sitting in the corner and decided to put it to use, I know it was working pre being stuck in the corner, but when I plugged it in, hit the power nothing happened. 7
- 8. Example Example SPAM OFFER IS SECRET CLICK SECRET LINK SECRET SPORTS LINK 8 HAM PLAY SPORTS TODAY WENT PLAY SPORTS SECRET SPORTS EVENT SPORT IS TODAY SPORT COSTS MONEY Questions: ◦Size of Vocabulary? ◦P(SPAM) = 13 words 3/8
- 9. Maximum Likelihood Maximum Likelihood 9 S S S H H H H H H p(S) = �� ◦�� �� �� �� �� �� �� �� �� �� �� �� �� = �� ���� �� �� =�� 1−�� ���� �� �� =�� �� ���� �� �� =�� 1−�� ���� �� �� =�� �� ���� �� �� �� �� �� �� �� �� =�� �� ���� �� �� =�� 1−�� ���� �� �� =�� 1−�� ���� �� �� �� �� �� �� �� �� =�� �� ���� �� �� =�� 1−�� ���� �� �� =�� �� ���� �� �� =�� 1−�� ���� �� �� =�� 1 1 1 0 0 0 0 0 ◦�� �� �� �� �� �� �� �� �� �� �� �� �� = �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� ∗ 1−�� 1−�� �� 1−�� 1−�� 1−�� 1−�� 1−�� �� 1−�� �� 1−�� 1−�� �� �� 1−�� �� 1−�� 1−�� �� ◦�� �������� �������� �������� = ��=1 8 �� �� �� ��=1 ��=1 8 �� �� �� 8 ��=1 8 �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� ��=1 8 �� �� �� = �� ����������( �� �� =1) �� �� ����������( �� �� =1) ����������( �� �� �� �� �� �� �� �� =1) �� ����������( �� �� =1) ∗ 1−�� ���������� �� �� =0 1−�� 1−�� 1−�� 1−�� ���������� �� �� =0 ���������� �� �� =0 �� �� �� �� �� �� �� �� =0 �� �� =0 1−�� ���������� �� �� =0 ◦ �� 3 �� �� 3 3 �� 3 ∗ 1−�� 5 1−�� 1−�� 1−�� 1−�� 5 5 1−�� 5 3 5
- 10. Slide61 SPAM OFFER IS SECRET CLICK SECRET LINK SECRET SPORTS LINK 10 HAM PLAY SPORTS TODAY WENT PLAY SPORTS SECRET SPORTS EVENT SPORT IS TODAY SPORT COSTS MONEY Questions: ◦P(“SECRET” | SPAM) = ◦P(“SECRET” | HAM) = 1/3 1/15
- 11. Naïve Bayes for Text Naïve Bayes for Text Bag-of-Words Naïve Bayes: ◦Predict unknown class label (spam vs. ham) ◦Assume evidence features (e.g. the words) are independent Generative model Tied distributions and bag-of-words ◦Usually, each variable gets its own conditional probability distribution P(F|Y) ◦In a bag-of-words model Each position is identically distributed All positions share the same conditional probs P(W|C) Why make this assumption? Word at position i, not ith word in the dictionary! 11
- 12. General Naïve Bayes General Naïve Bayes General probabilistic model: General naive Bayes model: We only specify how each feature depends on the class Total number of parameters is linear in n Y F1 Fn F2 |Y| parameters n x |F| x |Y| parameters |Y| x |F|n parameters 12
- 13. Slide62 SPAM OFFER IS SECRET CLICK SECRET LINK SECRET SPORTS LINK 13 HAM PLAY SPORTS TODAY WENT PLAY SPORTS SECRET SPORTS EVENT SPORT IS TODAY SPORT COSTS MONEY Questions: ◦MESSAGE M = “SPORTS” ◦P(SPAM | M) = 3/18 Applying Bayes’ Rule
- 14. Slide63 SPAM OFFER IS SECRET CLICK SECRET LINK SECRET SPORTS LINK 14 HAM PLAY SPORTS TODAY WENT PLAY SPORTS SECRET SPORTS EVENT SPORT IS TODAY SPORT COSTS MONEY Questions: ◦MESSAGE M = “SECRET IS SECRET” ◦P(SPAM | M) = 25/26 Applying Bayes’ Rule
- 15. Slide64 SPAM OFFER IS SECRET CLICK SECRET LINK SECRET SPORTS LINK 15 HAM PLAY SPORTS TODAY WENT PLAY SPORTS SECRET SPORTS EVENT SPORT IS TODAY SPORT COSTS MONEY Questions: ◦MESSAGE M = “TODAY IS SECRET” ◦P(SPAM | M) = 0 Applying Bayes’ Rule
- 16. Example: Spam Filtering Example: Spam Filtering Model: What are the parameters? Where do these tables come from? the : 0.0156 to : 0.0153 and : 0.0115 of : 0.0095 you : 0.0093 a : 0.0086 with: 0.0080 from: 0.0075 ... the : 0.0210 to : 0.0133 of : 0.0119 2002: 0.0110 with: 0.0108 from: 0.0107 and : 0.0105 a : 0.0100 ... ham : 0.66 spam: 0.33 Counts from examples! 16
- 17. Example: Overfitting Example: Overfitting Posteriors determined by relative probabilities (odds ratios): south-west : inf nation : inf morally : inf nicely : inf extent : inf seriously : inf ... What went wrong here? screens : inf minute : inf guaranteed : inf $205.00 : inf delivery : inf signature : inf ... 17
- 18. Generalization and Overfitting Generalization and Overfitting Raw counts will overfit the training data! ◦Unlikely that every occurrence of “minute” is 100% spam ◦Unlikely that every occurrence of “seriously” is 100% ham ◦What about all the words that don’t occur in the training set at all? 0/0? ◦In general, we can’t go around giving unseen events zero probability At the extreme, imagine using the entire email as the only feature ◦Would get the training data perfect (if deterministic labeling) ◦Would not generalize at all ◦Just making the bag-of-words assumption gives us some generalization, but isn’t enough To generalize better: we need to smooth or regularize the estimates 18
- 19. Estimation: Smoothing Estimation: Smoothing Maximum likelihood estimates: Problems with maximum likelihood estimates: ◦If I flip a coin once, and it’s heads, what’s the estimate for P(heads)? ◦What if I flip 10 times with 8 heads? ◦What if I flip 10M times with 8M heads? Basic idea: ◦We have some prior expectation about parameters (here, the probability of heads) ◦Given little evidence, we should skew towards our prior ◦Given a lot of evidence, we should listen to the data r g g 19
- 20. Estimation: Laplace Smoothing Estimation: Laplace Smoothing Laplace’s estimate (extended): ◦Pretend you saw every outcome k extra times c (x) is the number of occurrences of this value of the variable x. |x| is the number of values that the variable x can take on. k is a smoothing parameter. N is the total number of occurrences of x (the variable, not the value) in the sample size. ◦What’s Laplace with k = 0? ◦k is the strength of the prior Laplace for conditionals: ◦Smooth each condition independently: 20
- 21. Estimation: Linear Interpolation Estimation: Linear Interpolation In practice, Laplace often performs poorly for P(X|Y): ◦When |X| is very large ◦When |Y| is very large Another option: linear interpolation ◦Also get P(X) from the data ◦Make sure the estimate of P(X|Y) isn’t too different from P(X) ◦What if is 0? 1? 21
- 22. Real NB: Smoothing Real NB: Smoothing For real classification problems, smoothing is critical New odds ratios: helvetica : 11.4 seems : 10.8 group : 10.2 ago : 8.4 areas : 8.3 ... verdana : 28.8 Credit : 28.4 ORDER : 27.2 : 26.9 money : 26.5 ... Do these make more sense? 22
- 23. Tuning on Held-Out Data Tuning on Held-Out Data Now we’ve got two kinds of unknowns ◦Parameters: the probabilities P(Y|X), P(Y) ◦Hyperparameters, like the amount of smoothing to do: k How to learn? ◦Learn parameters from training data ◦Must tune hyperparameters on different data Why? ◦For each value of the hyperparameters, train and test on the held-out (validation)data ◦Choose the best value and do a final test on the test data 23
- 24. How to Learn How to Learn Data: labeled instances, e.g. emails marked spam/ham ◦Training set ◦Held out (validation) set ◦Test set Features: attribute-value pairs which characterize each x Experimentation cycle ◦Learn parameters (e.g. model probabilities) on training set ◦Tune hyperparameters on held-out set ◦Compute accuracy on test set ◦Very important: never “peek” at the test set! Evaluation ◦Accuracy: fraction of instances predicted correctly Overfitting and generalization ◦Want a classifier which does well on test data ◦Overfitting: fitting the training data very closely, but not generalizing well to test data Training Data Held-Out Data Test Data 24
- 25. What to Do About Errors? What to Do About Errors? Need more features– words aren’t enough! ◦Have you emailed the sender before? ◦Have 1K other people just gotten the same email? ◦Is the sending information consistent? ◦Is the email in ALL CAPS? ◦Do inline URLs point where they say they point? ◦Does the email address you by (your) name? Can add these information sources as new variables in the Naïve Bayes model 25
- 26. A Digit Recognizer A Digit Recognizer Input: x = pixel grids Output: y = a digit 0-9 26
- 27. Example: Digit Recognition Example: Digit Recognition Input: x = images (pixel grids) Output: y = a digit 0-9 Setup: ◦Get a large collection of example images, each labeled with a digit ◦Note: someone has to hand label all this data! ◦Want to learn to predict labels of new, future digit images Features: The attributes used to make the digit decision ◦Pixels: (6,8)=ON ◦Shape Patterns: NumComponents, AspectRatio, NumLoops ◦… 0 1 2 1 ?? 27
- 28.
Naïve Bayes for Digits
Naïve Bayes for Digits Simple version: ◦One feature Fij for each grid position
*◦Boolean features ◦Each input maps to a feature vector, e.g. ◦Here: lots of features, each is binary valued Naïve Bayes model: 28* - 29. Learning Model Parameters Learning Model Parameters 29
- 30. Problem: Overfitting Problem: Overfitting 2 wins!! 30
- 31. Regression Regression Start with very simple example ◦Linear regression What you learned in high school math ◦From a new perspective Linear model ◦y = m x + b ◦hw(x) = y = w1 x + w0 Find best values for parameters ◦“maximize goodness of fit” ◦“maximize probability” or “minimize loss” 31
- 32. Regression: Minimizing Loss Regression: Minimizing Loss ◦Assume true function f is given by y = f (x) = m x + b + noise where noise is normally distributed ◦Then most probable values of parameters found by minimizing squared-error loss: Loss(hw ) = Σj (yj – hw(xj))2 32
- 33. Regression: Minimizing Loss Regression: Minimizing Loss 33
- 34. Regression: Minimizing Loss Regression: Minimizing Loss y = w1 x + w0 Linear algebra gives an exact solution to the minimization problem 34
- 35. Linear Algebra Solution Linear Algebra Solution 35
- 36. Don’t Always Trust Linear Models Don’t Always Trust Linear Models 36
- 37. Regression by Gradient Descent Regression by Gradient Descent w = any point loop until convergence do: for each wi in w do: wi = wi – α ∂ Loss(w) ∂ wi 37
- 38. Multivariate Regression Multivariate Regression You learned this in math class too ◦hw(x) = w ∙ x = w xT = Σi wi xi The most probable set of weights, w* (minimizing squared error): ◦w* = (XT X)-1 XT y 38
- 39. Overfitting Overfitting To avoid overfitting, don’t just minimize loss Maximize probability, including prior over w Can be stated as minimization: ◦Cost(h) = EmpiricalLoss(h) + λ Complexity(h) For linear models, consider ◦Complexity(hw) = Lq(w) = ∑i | wi |q ◦L1 regularization minimizes sum of abs. values ◦L2 regularization minimizes sum of squares 39
- 40. Regularization and Sparsity Regularization and Sparsity L1 regularization L2 regularization Cost(h) = EmpiricalLoss(h) + λ Complexity(h) 40
- 41. Linear Separator Linear Separator - 41
- 42. Perceptron Perceptron - 42
- 43. Perceptron Algorithm Perceptron Algorithm Start with random w0, w1 Pick training example Update (α is learning rate) ◦w1 w1+α(y-f(x))x ◦w0 w0+α(y-f(x)) Converges to linear separator (if exists) Picks “a” linear separator (a good one?) 43
- 44. What Linear Separator to Pick? What Linear Separator to Pick? - 44
- 45. What Linear Separator to Pick? What Linear Separator to Pick? - Maximizes the “margin” Support Vector Machines 45
- 46. Non-Separable Data? Non-Separable Data? Not linearly separable for x1, x2 What if we add a feature? x3= x12+x22 See: “Kernel Trick” 46 X1 X2 - - X3
- 47. Nonparametric Models Nonparametric Models If the process of learning good values for parameters is prone to overfitting, can we do without parameters?
- 48. Nearest-Neighbor Classification Nearest-Neighbor Classification Nearest neighbor for digits: ◦Take new image ◦Compare to all training images ◦Assign based on closest example Encoding: image is vector of intensities: What’s the similarity function? ◦Dot product of two images vectors? ◦Usually normalize vectors so ||x|| = 1 ◦min = 0 (when?), max = 1 (when?) 48
- 49. Earthquakes and Explosions Earthquakes and Explosions Using logistic regression (similar to linear regression) to do linear classification 49
- 50. K=1 Nearest Neighbors K=1 Nearest Neighbors Using nearest neighbors to do classification 50
- 51. K=5 Nearest Neighbors K=5 Nearest Neighbors Even with no parameters, you still have hyperparameters! 51
- 52. Curse of Dimensionality Curse of Dimensionality Average neighborhood size for 10-nearest neighbors, n dimensions, 1M uniform points 52
- 53. Curse of Dimensionality Curse of Dimensionality Proportion of points that are within the outer shell, 1% of thickness of the hypercube 53
- 54. Slide65 References: ◦Peter Norvig and Sebastian Thrun, Artificial Intelligence, Stanford University http://www.stanford.edu/class/cs221/notes/cs221-lecture5- fall11.pdf 54

Thank you for your comment.