CS 1674: Intro to Computer Vision Final Review Prof. Adriana Kovashka University of Pittsburgh December 7, 2016 Final info Format: multiple-choice, true/false, fill in the blank, short answers, apply an algorithm Non-cumulative I expect you to know how to do a convolution Unlike last time, Ill have one handout with the exam questions, and a separate one where youre supposed to write the answers Algorithms you should be able to apply

K-means apply a few iterations to a small example Mean-shift to see where a single point ends up Hough transform write pseudocode only Hough transform how can we use it to find the parameters (matrix) of a transformation, when we have noisy examples? Compute a Spatial Pyramid at level 1 (2x2 grid) Formulate the SVM objective and constraints (in math) and explain it

Work through an example for zero-shot prediction Boosting show how to increase weights Pedestrian detection write high-level pseudocode Algorithms able to apply (contd) Compute neural network activations Compute SVM and softmax loss Show how to use weights to compute loss Show how to numerically compute gradient Show one iteration of gradient descent (with gradient computed for you) Apply convolution, RELU, max pooling

Compute output dimensions from convolution Extra office hours Monday, 3:30-5:30pm Anyone for whom this does not work? Requested topics Convolutional neural networks (16 requests)

Hough transform (8) Support vector machines (7) Deformable part models (6) Zero-shot learning (4) Face detection (2) Recurrent neural networks (2) K-means / mean-shift (1) Spatial pyramids (1) Convolutional neural networks Backpropagation + meaning of weights and how computed (5 requests) Math for neural networks + computing activations (4) Gradients + gradient descent (3) Convolution/non-linearity/pooling + convolution output size + architectures (3) Losses and finding weights that minimize them (2) Minibatch are the training examples cycled over more

than once? (1) Effect of number of neurons and regularization (1) Neural networks Deep neural network Figure from http://neuralnetworksanddeeplearning.com/chap5.html Neural network definition Activations: Nonlinear activation function h (e.g. sigmoid, tanh): Outputs: (binary) (multiclass)

How can I write y1 as a function of x1 xD? Figure from Christopher Bishop Activation functions Sigmoid tanh ReLU Adapted from Andrej Karpathy tanh(x) max(0,x)

Activation computation vs training When do I need to compute activations? How many times do I need to do that? How many times do I need to train a network to extract features from it? Activations: Forward propagation (start from inputs, compute activations from inputs to outputs) Training: Backward propagation (compute a loss at the outputs, backpropagate error towards the inputs) Backpropagation: Graphic example First calculate error of output units and use this to change the top layer of weights. output k

Update weights into j hidden input Adapted from Ray Mooney j i Backpropagation: Graphic example Next calculate error for hidden units based on errors on the output units it feeds into. output k

hidden input Adapted from Ray Mooney j i Backpropagation: Graphic example Finally update bottom layer of weights based on errors calculated for hidden units. output Update weights into i

k hidden input Adapted from Ray Mooney j i Loss gradients Denoted as (diff notations): i.e. how does the loss change as a function of the weights We want to find those weights (change the weights in such a way) that makes the loss

decrease as fast as possible Gradient descent Well update weights Move in direction opposite to gradient: L Time Learning rate W_2 negative gradient direction Figure from Andrej Karpathy W_1 original W

Computing derivatives In 1-dimension, the derivative of a function: w w w w In multiple dimensions, the gradient is the vector of (partial derivatives). Andrej Karpathy Computing derivatives

current W: gradient dW: [0.34, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33,] loss 1.25347 [?, ?, ?,

?, ?, ?, ?, ?, ?,] Andrej Karpathy Computing derivatives current W: W + h (first dim): gradient dW: [0.34, -1.11,

0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33,] loss 1.25347 [0.34 + 0.0001, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33,]

loss 1.25322 [?, ?, ?, ?, ?, ?, ?, ?, ?,] Andrej Karpathy Computing derivatives current W: W + h (first dim):

[0.34, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33,] loss 1.25347 [0.34 + 0.0001, -1.11, 0.78, 0.12, 0.55, 2.81,

-3.1, -1.5, 0.33,] loss 1.25322 Andrej Karpathy gradient dW: [-2.5, ?, ?, ?,- 1.25347)/0.0001 (1.25322 = -2.5 ?, ?, ?, ?, ?,]

How to formulate losses? Losses depend on the prediction functions (scores), e.g. fW(x) = 3.2 for class cat One set of weights for each class! The prediction functions (scores) depend on the inputs (x) and the model parameters (W) Hence losses depend on W E.g. for a linear classifier, scores are: For a neural network: Linear classifier 3072x1 10x1 10x3072

(+b) 10x1 10 numbers, indicating class scores [32x32x3] array of numbers 0...1 parameters, or weights Andrej Karpathy Neural network In the second layer of weights, one set of weights to compute the probability of each class Linear classifier: SVM loss

Suppose: 3 training examples, 3 classes. With some W the scores are: cat car frog Andrej Karpathy 3.2 5.1 -1.7 1.3 4.9 2.0

2.2 2.5 -3.1 Multiclass SVM loss: Given an example is the image and where is the (integer) label, where and using the shorthand for the scores vector: the SVM loss has the form: Linear classifier: SVM loss Suppose: 3 training examples, 3 classes. are: With some W the scores

cat car frog Losses: Andrej Karpathy 3.2 5.1 -1.7 2.9 1.3 2.2 4.9 2.5

2.0 3.1 Multiclass SVM loss: Given an example is the image and where is the (integer) label, where and using the shorthand for the scores vector: the SVM loss has the form: - = max(0, 5.1 - 3.2 + 1) +max(0, -1.7 - 3.2 + 1) = max(0, 2.9) + max(0, -3.9)

= 2.9 + 0 = 2.9 Linear classifier: SVM loss Suppose: 3 training examples, 3 classes. are: With some W the scores cat car frog Losses: Andrej Karpathy 3.2

5.1 -1.7 1.3 4.9 2.0 2.9 0 2.2 2.5 -3.1 Multiclass SVM loss: Given an example is the image and

where is the (integer) label, where and using the shorthand for the scores vector: the SVM loss has the form: = max(0, 1.3 - 4.9 + 1) +max(0, 2.0 - 4.9 + 1) = max(0, -2.6) + max(0, -1.9) =0+0 =0 Linear classifier: SVM loss Suppose: 3 training examples, 3 classes. are: With some W the scores

cat car frog Losses: Andrej Karpathy 3.2 5.1 -1.7 1.3 4.9 2.0 2.2 2.5 -3.1

2.9 0 10.9 Multiclass SVM loss: Given an example is the image and where is the (integer) label, where and using the shorthand for the scores vector: the SVM loss has the form: = max(0, 2.2 - (-3.1) + 1)

+max(0, 2.5 - (-3.1) + 1) = max(0, 5.3) + max(0, 5.6) = 5.3 + 5.6 = 10.9 Linear classifier: SVM loss Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example is the image and where is the (integer) label, where and using the shorthand for the scores vector:

the SVM loss has the form: cat car frog Losses: 3.2 5.1 -1.7 2.9 1.3 4.9 2.0 0 2.2

2.5 -3.1 10.9 and the full training loss is the mean over all examples in the training data: L = (2.9 + 0 + 10.9)/3 = 4.6 Lecture 3 - 12 Andrej Karpathy Linear classifier: SVM loss Andrej Karpathy

Linear classifier: SVM loss Weight Regularization = regularization strength (hyperparameter) In common use: L2 regularization L1 regularization Dropout (will see later) In the case of a neural network: Regularization turns some neurons off (they dont matter for computing an activation) Adapted from Andrej Karpathy Effect of regularization Do not use size of neural network as a regularizer. Use stronger regularization instead:

(you can play with this demo over at ConvNetJS: http://cs.stanford. edu/people/karpathy/convnetjs/demo/classify2d.html) Andrej Karpathy Effect of number of neurons more neurons = more capacity Andrej Karpathy Softmax loss unnormalized probabilities cat

car frog 3.2 5.1 -1.7 exp 24.5 164.0 0.18 unnormalized log probabilities Andrej Karpathy normalize

0.13 0.87 0.00 probabilities L_i = -log(0.13) Mini-batch gradient descent In classic gradient descent, we compute the gradient from the loss for all training examples Could also only use some of the data for each gradient update, then cycle through all training samples Yes, we cycle through the training examples multiple times (each time weve cycled through all of them once is called

an epoch) Allows faster training (e.g. on GPUs), parallelization A note on training The more weights you need to learn, the more data you need Thats why with a deeper network, you need more data for training than for a shallower network Thats why if you have sparse data, you only train the last few layers of a deep net Set these to the already learned weights from another network Learn these on your own task

Convolutional neural networks Convolutional Neural Networks (CNN) Feed-forward feature extraction: 1. Convolve input with learned filters 2. Apply non-linearity 3. Spatial pooling (downsample) Output (class probs) Spatial pooling Non-linearity Convolution (Learned) Input Image Adapted from Lana Lazebnik

1. Convolution Apply learned filter weights One feature map per filter Stride can be greater than 1 (faster, less memory) . . . Adapted from Rob Fergus Input Feature Map 2. Non-Linearity Per-element (independent)

Options: Tanh Sigmoid: 1/(1+exp(-x)) Rectified linear unit (ReLU) Avoids saturation issues Adapted from Rob Fergus 3. Spatial Pooling Sum or max over non-overlapping / overlapping regions Role of pooling: Invariance to small transformations Larger receptive fields (neurons see more of input)

Rob Fergus, figure from Andrej Karpathy Convolutions: More detail Convolution Layer 32 32 3 Andrej Karpathy 32x32x3 image 5x5x3 filter 1 number: the result of taking a dot product between the filter and a small 5x5x3 chunk of the image (i.e. 5*5*3 = 75-dimensional dot product + bias)

Convolutions: More detail Convolution Layer 32 activation map 32x32x3 image 5x5x3 filter 28 convolve (slide) over all spatial locations 28 32 3 Andrej Karpathy

1 Convolutions: More detail For example, if we had 6 5x5 filters, well get 6 separate activation maps: activation maps 32 28 Convolution Layer 28 32 3 6 We stack these up to get a new image of size 28x28x6!

Andrej Karpathy Convolutions: More detail Preview: ConvNet is a sequence of Convolutional Layers, interspersed with activation functions 32 28 CONV, ReLU e.g. 6 32 3 Andrej Karpathy

5x5x3 filters 28 6 24 CONV, ReLU e.g. 10 5x5x6 filters CONV, ReLU 24

10 . Convolutions: More detail Preview Andrej Karpathy [From recent Yann LeCun slides] Convolutions with some stride N Output size: (N - F) / stride + 1 F

F Andrej Karpathy N e.g. N = 7, F = 3: stride 1 => (7 - 3)/1 + 1 = 5 stride 2 => (7 - 3)/2 + 1 = 3 stride 3 => (7 - 3)/3 + 1 = 2.33 :\ Convolutions with some padding In practice: Common to zero pad the border 0 0 0

0 0 0 0 0 e.g. input 7x7 3x3 filter, applied with stride 1 pad with 1 pixel border => what is the output? 0 0 7x7 output!

in general, common to see CONV layers with stride 1, filters of size FxF, and zero-padding with (F-1)/2. (will preserve size spatially) e.g. F = 3 => zero pad with 1 F = 5 => zero pad with 2 F = 7 => zero pad with 3 (N + 2*padding - F) / stride + 1 Andrej Karpathy Combining all three steps Andrej Karpathy A common architecture: AlexNet Figure from http://www.mdpi.com/2072-4292/7/11/14680/htm

Hough transform (RANSAC wont be on the exam) Least squares line fitting Data: (x1, y1), , (xn, yn) y=mx+b Line equation: yi = m xi + b Find (m, b) to minimize (xi, yi) n E i 1 (m xi b yi ) 2 where line you found tells where point really is

you point is along y axis along y axis Adapted from Svetlana Lazebnik You want to find a single line that explains all of the points in your data, but data may be noisy! Outliers affect least squares fit Kristen Grauman Outliers affect least squares fit Kristen Grauman Dealing with outliers: Voting

Voting is a general technique where we let the features vote for all models that are compatible with it. Cycle through features, cast votes for model parameters. Look for model parameters that receive a lot of votes. Noise & clutter features? They will cast votes too, but typically their votes should be inconsistent with the majority of good features. Adapted from Kristen Grauman Finding lines in an image: Hough space y b b0 x

image space m0 m Hough (parameter) space Connection between image (x,y) and Hough (m,b) spaces A line in the image corresponds to a point in Hough space Steve Seitz Finding lines in an image: Hough space y b

y0 x0 x image space m Hough (parameter) space Connection between image (x,y) and Hough (m,b) spaces A line in the image corresponds to a point in Hough space What does a point (x0, y0) in the image space map to? Answer: the solutions of b = -x0m + y0 This is a line in Hough space

To go from image space to Hough space: Adapted from Steve Seitz Finding lines in an image: Hough space y y0 b (x1, y1) (x0, y0) b = x1m + y1 x0 x

image space m Hough (parameter) space What are the line parameters for the line that contains both (x0, y0) and (x1, y1)? It is the intersection of the lines b = x0m + y0 and b = x1m + y1 Steve Seitz Finding lines in an image: Hough space y b

x image space m Hough (parameter) space How can we use this to find the most likely parameters (m,b) for the most prominent line in the image space? Let each edge point in image space vote for a set of possible parameters in Hough space Accumulate votes in discrete set of bins; parameters with the most votes indicate line in image space. Steve Seitz Parameter space representation

P.V.C. Hough, Machine Analysis of Bubble Chamber Pictures, Proc. Int. Conf. High Energy Accelerators and Instrumentation, 1959 Use a polar representation for the parameter space Each line is a sinusoid in Hough parameter space y x Hough space x cos y sin Silvio Savarese

Algorithm outline Initialize accumulator H to all zeros For each feature point (x,y) in the image = gradient orientation at (x,y) = x cos + y sin H(, ) = H(, ) + 1 end Find the value(s) of (*, *) where H(, ) is a local maximum The detected line in the image is given by * = x cos * + y sin *

Svetlana Lazebnik Hough transform for circles ( xi a ) 2 ( yi b) 2 r 2 x = a + r cos() y = b + r sin() For every edge pixel (x,y): = gradient orientation at (x,y) For each possible radius value r: a = x r cos() b = y r sin() H[a,b,r] += 1 end end

Modified from Kristen Grauman x Hough transform for finding transformation parameters A1 A2 B1 A3 B2 B3

Given matched points in {A} and {B}, estimate the translation of the object xiB xiA t x B A t yi yi y Derek Hoiem Hough transform for finding transformation parameters B4 A1 A2 B5 (tx, ty) A3

B1 B2 A4 A5 B6 B3 A6 Problem: outliers, multiple objects, and/or many-to-one matches Hough transform solution 1. Initialize a grid of parameter values 2. Each matched pair casts a vote for consistent values

3. Find the parameters with the most votes Adapted from Derek Hoiem xiB xiA t x B A t yi yi y Support vector machines Linear classifiers Find linear function to separate positive and negative examples x i positive : x i w b 0 x i negative :

x i w b 0 Which line is best? C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998 Support vector machines Discriminative classifier based on optimal separating line (for 2d case) Maximize the margin between the positive and negative training examples C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

Support vector machines Want line that maximizes the margin. 1 b= + =0 1 wx +b = wx +b x w x i positive ( yi 1) : x i w b 1 x i negative ( yi 1) :

x i w b 1 For support, vectors, x i w b 1 Distance between point and line: Support vectors Margin | x i w b | || w || For support vectors: w x b 1

1 1 2 M w w w w w C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998 Finding the maximum margin line 1. Maximize margin 2/||w|| 2. Correctly classify all training data points:

x i positive ( yi 1) : x i w b 1 x i negative ( yi 1) : x i w b 1 Quadratic optimization problem: Objective Constraints Minimize 1 T

w w 2 Subject to y (wx +b) 1 i i One constraint for each training point. Note sign trick. C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998 Finding the maximum margin line Solution: w i i yi x i Learned weight

Support vector C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998 Finding the maximum margin line Solution: w i i yi x i b = yi wxi (for any support vector) Classification function: f ( x) sign (w x b) sign y x x b i i

i i If f(x) < 0, classify as negative, otherwise classify as positive. Notice that it relies on an inner product between the test point x and the support vectors xi C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998 The Kernel Trick The linear classifier relies on dot product between vectors K(xi , xj) = xi xj If every data point is mapped into high-dimensional space via some transformation : xi (xi ), the dot product becomes: K(xi , xj) = (xi ) (xj)

The kernel trick: instead of explicitly computing the lifting transformation (x), define a kernel function K such that: K(xi , xj) = (xi ) (xj) Andrew Moore Nonlinear SVMs Datasets that are linearly separable work out great: x 0 But what if the dataset is just too hard? x 0 We can map it to a higher-dimensional space:

x2 0 Andrew Moore x Nonlinear kernel: Example Consider the mapping ( x) ( x, x 2 ) x2 ( x) ( y ) ( x, x 2 ) ( y, y 2 ) xy x 2 y 2 K ( x, y ) xy x 2 y 2 Svetlana Lazebnik Examples of kernel functions T

K ( xi , x j ) xi x j Linear: Polynomials of degree up to d: , 2 Gaussian RBF: xi x j

K ( xi ,x j ) exp( ) 2 2 Histogram intersection: K ( xi , x j ) min( xi (k ), x j (k )) k Andrew Moore / Carlos Guestrin Allowing misclassifications: Before Objective The w that minimizes Maximize margin Constraints

Allowing misclassifications: After Objective Misclassification cost # data samples Slack variable The w that minimizes Maximize margin Constraints Minimize misclassification Deformable part models?

Zero-shot learning Image Classification: Textual descriptions Introduction Which image shows an aye-aye? Description, Aye-aye . . . is nocturnal lives in trees has large eyes has long middle fingers We can classify based on textual descriptions

Thomas Mensink Introduction Zero-shot recognition (2) 1.Vocabulary of attributes and class descriptions: Aye-ayes have properties X, and Y, but no 2.Train classifiers for each attibute X, Y, Z. From visual examples of related classes 3.Make image attributes predictions: 4.Combine into decision: this image is not an Ayeaye Thomas Mensink Introduction Zero-shot recognition (2)

P (X|img) = 0.8 1.Vocabulary of attributes and class descriptions: Pproperties (Y |img) = 0.3 Aye-ayes have X, and Y, but no P (Z|img) = 0.6 2.Train classifiers for each attibute X, Y, Z. From visual examples of related classes 3.Make image attributes predictions: 4.Combine into decision: this image is not an Ayeaye Thomas Mensink Attribute-based classification DAP: Probabilistic model

Define attribute probability: p(am= a z |x ) m . z = if am = p(am | 1 1 p(a

|x ) m x) otherwise Assign a given image to class z Adapted from Thomas Mensink Example Cat attributes: Bear attributes: [1 0 0 1 0] [0 1 0 0 0] Image Xs probability of the attributes:

P(attributei = 1|X) = [0.7 0.4 0.8 0.5 0.1] Probability that class(X) = cat: Probability that class(X) = bear: