CS725 Machine Learning Lecture Notes

Download PDF

/ 116

100%

116

Document text

Lecture notes on CS725 : Machine learning

Contents

1 Lecture 1 : Introdcution to Machine Learning 6

2 Lecture 2 7

2.1 Solving Least Squares in General (for Linear models) . . . . . . . . . . . . . . . . . . . 7

3 Lecture 3 : Regression 10

3.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.3 Least square solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.4 Geometrical interpretation of least squares . . . . . . . . . . . . . . . . . . . . . . . . . 11

4 Lecture 4 : Least Squares Linear Regression 13

4.1 Least Square Linear Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.2 Level Curves and Surfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.3 Gradient Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.4 Directional Derivative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.5 Hyperplane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.6 Tangential Hyperplane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.7 Gradient Descent Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.8 Local Minimum and Local Maximum . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5 Lecture 5 : Convex functions 19

5.1 Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5.2 Point 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5.3 Point 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5.4 Point 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.5 Point 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.5.1 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.5.2 Next problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.6 Point 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

6 Lecture 6 : Regularized Solution to Regression Problem 24

6.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

6.2 Duality and KKT conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

6.3 Bound on λ in the regularized least square solution . . . . . . . . . . . . . . . . . . . 26

6.4 RMS Error variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

6.5 Alternative objective function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

6.6 A review of probability theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

6.6.1 The three axioms of probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

6.6.2 Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

6.6.3 Independent events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

7 Lecture 7 : Probability 31

7.1 Note . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

7.2 Part of speech(pos) example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

7.3 Probability mass function(pmf) and probability density function(pdf) . . . . . . . . . 31

7.3.1 Joint distribution function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

7.3.2 Marginalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

7.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

7.5 Conditional Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

7.6 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

7.6.1 Properties of E(x) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

7.7 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

7.8 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

7.8.1 Properties of Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

7.9 Chebyshev’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

7.10 Bernoulli Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

7.11 Binomial Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

7.12 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

7.13 Maximum Likelihood and Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

7.14 Bayesian estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

8 Lecture 8 38

8.1 Bernoulli Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

8.2 Bayesian Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

9 Lecture 9 : Multinomial Distribution 42

9.0.1 Posterior probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

9.0.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

9.1 Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

9.1.1 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

9.1.2 Expectation for I(X=x): . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

9.1.3 Observations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

9.1.4 Properties of gaussian univariate distribution . . . . . . . . . . . . . . . . . . 45

10 Lecture 10 : Multivariate Gaussian Distribution 47

10.1 Multivariate Gaussian Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

10.1.1 Unbiased Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

10.2 Dealing with Conjugate Priors for Multivariate Gaussian . . . . . . . . . . . . . . . . 50

11 Lecture 11 52

11.1 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

11.2 Bayes Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

11.3 Pure Bayesian - Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

11.4 Sufficient Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

11.5 Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

12 Lecture 12 : Bias-Variance tradeoff 56

12.1 Expected Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

13 Lecture 13 59

13.1 Conclude Bias-Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

13.1.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

13.1.2 Bayesian Linear Regression(BLR) . . . . . . . . . . . . . . . . . . . . . . . . . 59

13.1.3 General Problems with Standard Distribution . . . . . . . . . . . . . . . . . . 61

13.2 Emperical Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

13.2.1 First Approach: Approximate the posterior . . . . . . . . . . . . . . . . . . . . 64

13.2.2 Second Approach: Emperical Bayes . . . . . . . . . . . . . . . . . . . . . . . . 65

13.2.3 Sove the eigenvalue equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

14 Lecture 14 : Introduction to Classification 66

15 Lecture 15: Linear Models for Classification 67

15.1 Generalized linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

15.2 Three broad types of classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

15.2.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

15.3 Handling Multiclasses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

15.3.1 Avoiding ambiguities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

15.4 Least Squares approach for classification . . . . . . . . . . . . . . . . . . . . . . . . . 70

15.4.1 Limitations of Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

16 Lecture 16 72

16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

16.2 Problems of linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

16.2.1 Sensitivity to outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

16.2.2 Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

16.3 Possible solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

16.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

17 Lecture 17 77

18 Lecture 18:Perceptron 78

18.1 Fisher’s discriminant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

18.2 Perceptron training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

18.2.1 Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

19 Lecture 19 82

19.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

19.2 Margin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

19.3 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

19.4 Support Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

19.5 Objective Design in SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

19.5.1 Step 1:Perfect Separability . . . . . . . . . . . . . . . . . . . . . . . . . . 84

19.5.2 Step 2:Optimal Separating Hyperplane For Perfectly Separable Data 84

19.5.3 Step 2:Separating Hyperplane For Overlapping Data . . . . . . . . . 85

20 Lecture 20: Support Vector Machines (SVM) 88

20.1 Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

20.2 Distance between the points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

20.3 Formulation of the optimization problem . . . . . . . . . . . . . . . . . . . . . . . . . 89

20.4 Soft Margin SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

20.4.1 Three types of g points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

20.5 Primal and Dual Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

20.5.1 Primal formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

20.5.2 Dual Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

20.6 Duality theory applied to KKT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

21 Lecture 21:The SVM dual 94

21.1 SVM dual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

21.2 Kernel Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

21.2.1 Generation of φ space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

21.3 Requirements of Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

21.3.1 Examples of Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

21.4 Properties of Kernel Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

22 Lecture 22: SVR and Optimization Techniques 98

22.1 Other occurance of kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

22.1.1 Some variants of SVM’s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

22.2 Support Vector Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

22.3 L1 SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

22.4 Kernel Adatron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

23 Lecture 23 104

23.1 Sequential minimization algorithm - SMO . . . . . . . . . . . . . . . . . . . . . . . . 105

23.2 Probablistic models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

24 Lecture 24: Prob. Classifiers 107

24.1 Non Parametric Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

24.2 Parametric Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

25 Lecture 25 112

25.1 Exponential Family Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

25.2 Discrete Feature Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

25.3 Naive Bayes Assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

25.4 Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

25.5 Graphical Representation of Naive Bayes . . . . . . . . . . . . . . . . . . . . . 114

25.6 Graph Factorisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

25.7 Naive Bayes Text Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

1 Lecture 1 : Introdcution to Machine Learning

This lecture was an introduction to machine learning. ...

2 Lecture 2

2.1 Solving Least Squares in General (for Linear models)

Xplot = [0, 0.1, ..., 2] Curve

Why noise ?

– Since observations are not perfect

– Owing to quantization / precision / rounding

Curve Fitting

Learn f : X → Y such that E(f, X, Y1) is minimized. Here the error function E and form of the

function to learn f is chosen by the modeler.

Consider one such form of f,

f(x) = w0 + w1x + w2x

2 + ... + wtx

The sum of squares error is given by,

E =

i=1

(f(xi) − yi)

So the expression is,

argmin

w=[w1,w2,...wt]

i=1

[(w0 + w1x + w2x

2 + ... + wtx

) − y1(i)]2

If there are m data points, then a polynomial of degree m − 1 can exactly fit the data, since the

polynomial has m degrees of freedom (where degrees of freedom=no. of coefficients)

As the degree of the polynomial increases beyond m, the curve becomes more and more wobbly,

while still passing through the points. Contrast the degree 10 fit in Figure 2.1 against the degree 5

fit in Figure 2.1. This is due to the problem of overfitting (overspecification)

Now E is a convex function. To optimize it, we need to set ∇wE = 0. The ∇ operator is also

called gradient.

Solution is given by,

X = (φ

tφ)

−1φ

If m << t then

φ becomes singular and the solution cannot be found OR

The column vectors in φ become nearly linearly dependent

RMS (root mean sqare) error is given by :

RMS =

Lecture 2

Figure 1: Fit for degree 5 polynomial.

Generally, some test data (which potentially could have been part of the training data) is held

out for evaluating the generalized performance of the model. Another held out fraction of the

training data, called the validation dataset is typically used to find the most appropriate degree

tbest for f.

Lecture 2

Figure 2: Fit for degree 10 polynomial. Note how wobbly this fit is.

3 Lecture 3 : Regression

This lecture was about regression. It started with formally defining a regression problem. Then a

simple regression model called linear regression was discussed. Different methods for learning the

parameters in the model were next discussed. It also covered least square solution for the problem

and its geometrical interpretation.

3.1 Regression

Suppose there are two sets of variables x ∈ <n and y ∈ <k

such that x is independent and y

is dependant. The regression problem is concerned with determining y in terms of x. Let us

assume that we are given m data points D = hx1, y1i,hx2, y2i, ..,hxm, ymi. Then the problem is

to determine a function f

∗

such that f

∗

(x) is the best predictor for y, with respect to D. Suppose

ε(f, D) is an error function, designed to reflect the discrepancy between the predicted value f(x

)

of y

0 and the actual value y

for any hx

, y

i ∈ D, then

∗ = argmin

f∈F

ε(f, D) (1)

where, F denotes the class of functions over which the optimization is performed.

3.2 Linear regression

Depending on the function class we consider, there are many types of regression problems. In Linear

regression we consider only linear functions, functions that are linear in the basis function. Here

F is of the form {

i=1 wiφi(x)}. φi

: R

n → R

k Here, the φi

’s are called the basis functions (for

example, we can consider φi(x) = x

, i.e., polynomial basis functions) .

Any function in F is characterized by its parameters, the wi

’s. Thus, in (1) we have to find

f(w∗

) where

w∗ = argmin

ε(w, D)

3.3 Least square solution

The error function ε plays a major role in the accuracy and tractability of the optimization problem.

The error function is also called the loss function. The squared loss is a commonly used loss

function. It is the sum of squares of the differences between the actual value and the predicted

value.

ε(f, D) = X

hxi,yii∈D

(f(xi) − yi)

So the least square solution for linear regression is given by

w∗ = argmin

j=1

i=1

(wiφi(xj ) − yj

The minimum value of the squared loss is zero. Is it possible to achieve this value ? In other

words is ∀j, Xp

i=1

wiφi(xj ) = yj possible ?

3.4 Geometrical interpretation of least squares

C(φ)

yˆ

Figure 3: Least square solution ˆy is the orthogonal projection of y onto column space of φ

The above equality can be written as ∀u, φT

(xu)w = yu

or equivalently φw = y where

φ =







φ1(x1) ··· φp(x1)

φ1(xm) ··· φp(xm)







and y =













It has a solution if y is in the column space (the subspace of R

n formed by the column vectors)

of φ. It is possible that there exists no w which satisfies the conditions? In such situations we can

solve the least square problem.

3.4 Geometrical interpretation of least squares

Let yˆ be a solution in the column space of φ. The least squares solution is such that the distance

between yˆ and y is minimized. From the diagram it is clear that for the distance to be minimized,

the line joining yˆ to y should be orthogonal to the column space. This can be summarized as

1. φw = yˆ

2. ∀v ∈ {1, ..p}, (y − yˆ)

T φv = 0 or (yˆ − y)

T φ = 0

yˆ

T φ = y

T φ

ie, (φw)

T φ = y

T φ

ie, wT φ

T φ = y

T φ

ie, φT φw = φ

T y

∴ w = (φ

T φ)

−1y

In the last step, please note that, φ

T φ is invertible only if φ has full column rank.

4 Lecture 4 : Least Squares Linear Regression

In this lecture we discussed how to minimize the error function ε(w, D) that we used for the

least square linear regression model in the last lecture. To do this, some basic concepts regarding

minimization of a function were discussed and we applied these to our error function.

4.1 Least Square Linear Regression Model

In the least squares regression model, we determine the value of w for which our error function ε

attains the minimum value. Here, D =< x1, y1 >, < x2, y2 >, .., < xm, ym > is the training data

set, and φi

’s are the basis functions.

w∗ = argmin

j=1

(f (xj , w) − yj )

= argmin

j=1 Xp

i=1

wiφi(xj ) − yj

φ =







φ1(x1) φ2(x1) ...... φp(x1)

φ1(xm) φ2(xm) ...... φp(xm)







y =













w =













ε = min

j=1

(xj )w − yj

= min

||φw − y||2

= min

(φw − y)

= min

wT φ

T φw − 2y

T φw + y

T y

(2)

4 Lecture 4 : Least Squares Linear Regression

How to minimize a function?

Following are some basic concepts which help in minimizing or maximizing a function:

4.2 Level Curves and Surfaces

A level curve of a function f(x) is defined as a curve along which the value of the function remains

unchanged while we change the value of it’s argument x. Note that there can be as many level

curves for any function as the number of different values it can attain.

Figure 4: 10 level curves for the function f(x1, x2) = x1e

(Figure 4.12 from [1])

Level surfaces are similarly defined for any n-dimensional function f(x1, x2, ..., xn) as a collection

of points in the argument space on which the value of the function is same at all points while we

change the argument values.

Formally we can define a level curve as :

Lc(f) =

x|f(x) = c

where c is a constant.

Refer to Fig. 4.15 in class notes [1] for example.

4 Lecture 4 : Least Squares Linear Regression

Figure 5: 3 level surfaces for the function f(x1, x2, x3) = x

1 +x

2 +x

3 with c = 1, 3, 5. The gradient

at (1, 1, 1) is orthogonal to the level surface f(x1, x2, x3) = x

1 + x

2 + x

3 = 3 at (1, 1, 1) (Fig. 4.14

from [1]).

4.3 Gradient Vector

The gradient vector of a function f at a point x is defined as follows:

∇fx∗ =







∂f(x)

∂x1

∂f(x)

∂x2

∂f(x)

∂xn







The direction of the gradient vector gives the direction of maximum rate of change of the value of

the function at a point. Also the magnitude of the gradient vector gives that maximum value of

rate of change.

Refer to Definition 23 in the class notes [1] for more details.

4.4 Directional Derivative

Directional Derivative gives the rate of change of the function value in a given direction at a point.

The directional derivative of a function f in the direction of a unit vector v at a point x can be

defined as :

Dv(f) = lim

h→0

f(x + hv) − f(x)

||v|| = 1

Note: The maximum value of directional derivative of a function f at any point is always the

magnitude of it’s gradient vector at that point.

4 Lecture 4 : Least Squares Linear Regression

Figure 6: The level curves from Figure 4 along with the gradient vector at (2, 0). Note that the

gradient vector is perpenducular to the level curve x1e

x2 = 2 at (2, 0) (Figure 4.13 from [1])

Refer to Definition 22 and Theorem 58 in the class notes [1] for more details.

4.5 Hyperplane

A Hyperplane is a set of points whose direction w.r.t. a point p is orthogonal to a vector v. It can

be formally defined as :

Hv,p =

q | (p − q)

T v = 0

4.6 Tangential Hyperplane

There are two definitions of tangential hyperplane (T Hx∗ ) to level surface (Lf(x∗)(f)) of f at x

∗

1. Plane consisting of all tangent lines at x

∗

to any parametric curve c(t) on level surface.

2. Plane orthogonal to the gradient vector at x

∗

T Hx∗ =

p | (p − x

∗

)

T ∇f(x

∗

) = 0

Note: By definition, T Hx∗ is n − 1 dimensional.

Refer to Definition 24 and Theorem 59 in class notes [1] for more details.

4.7 Gradient Descent Algorithm

Gradient Descent Algorithm is used to find minimum value attained by a real valued function f(x).

We first start at an intial point x

(0) and make a sequence of steps proportional to negative of

gradient of the function at the point. Finally we stop at a point x

(∗) where a desired convergence

4 Lecture 4 : Least Squares Linear Regression 17

criterion (see notes on Convex Optimization) will be attained.

The idea of gradient descent algorithm is based on the fact that if a real-valued function f(x) is

defined and differentiable at a point x

, then f(x) decreases fastest when you move in the direction

of the negative gradient of the function at that point, which is −∇f(x).

Here we describe the method of Gradient Descent Algorithm to find the parameter vector w

which minimizes the error function, ε(w, D)

∆w(k) = − ∇ε(w

(k)

) from equation (2)

= −∇(wT φ

T φw − 2y

T φw + y

T y)

= −(2φ

T φw − 2y

T φ + 0)

= 2(φ

T y − φ

T φw)

so we got

w(k+1) = w(k) + 2t

(k)

(φ

T y − φ

T φw(k)

)

Gradient Descent Algorithm :

Find starting point w(0)D

repeat

1. ∆wk = −∇ε(w(k)

)

2. Choose a step size t

(k) > 0 using exact or backtracking ray search.

3. Obtain w(k+1) = w(k) + t

(k)∆w(k)

4. Set k = k + 1.

until stopping criterion (such as ∇ε(x

(k+1)) ≤ ) is satisfied

Exact Line Search Algorithm :

(k) = argmin

w(k) + 2t

T y − φ

T φw(k)

In general

(k) = argmin

w(k+1)

Refer to section 4.5.1 in the class notes [1] for more details.

4.8 Local Minimum and Local Maximum

Critical Point : x is a called a critical point w.r.t to a function f, if ∇f(x) = 0 i.e. the gradient

vanishes at x or the gradient fails to exist at x.

Local Minimum (or Maximum):

If ∇f(x

∗

) = 0 then x

∗

can be a point of local minimum (or maximum). [Neccessary Condition]

If ∇2f(x

∗

) is positive (negative) definite then x

∗

is a point of local minimum (maximum). [Sufficient Condition]

Note: ∇2f(x

∗

) is positive definite means :

∀x 6= 0 xT ∇2

f(x

∗

)x > 0

λi(∇2

f(x

∗

)) > 0

i.e. matrix eigen values are positive.

Refer to definition 27, theorem 61 and fig. 4.23, 4.24 in the class notes [1] for more details.

Figure 7: Plot of f(x1, x2) = 3x

1 − x

1 − 2x

2 + x

, showing the various local maxima and minima

of the function (fig. 4.16 from [1])

5 Lecture 5 : Convex functions

In this lecture the concepts of convex sets and functions were introduced.

5.1 Recap

We recall that the problem was to find w such that

w∗ = argmin

||φw − y||2

(3)

= argminw(wT φ

T φw − 2wT φy − y

T y) (4)

5.2 Point 1

If ∇f(x

∗

) is defined & x

∗

is local minimum/maximum, then ∇f(x

∗

) = 0

(A necessary condition) (Cite : Theorem 60)[2]

Given that

f(w) = argmin

(wT φ

T φw − 2wT φy − y

T y) (5)

=⇒ ∇f(w) = 2φ

T φw − 2φ

T y (6)

we would have

∇f(w∗

) = 0 (7)

=⇒ 2φ

T φw∗ − 2φ

T y = 0 (8)

=⇒ w∗ = (φ

T φ)

−1φ

T y (9)

5.3 Point 2

Is ∇2f(w∗

) positive definite ?

i.e. ∀x 6= 0, is x

T ∇f(w∗

)x > 0? (A sufficient condition for local minimum)

(Note : Any positive definite matrix is also positive semi-definite)

(Cite : Section 3.12 & 3.12.1)[3]

∇2

f(w∗

) = 2φ

T φ (10)

=⇒ x

T ∇2f(w∗

)x = 2x

T φ

T φx (11)

= 2(φx)

T φx (12)

= 2 ||φx||2 ≥ 0 (13)

And if φ has full column rank ,

φx = 0 iff x = 0 (14)

5 LECTURE 5 : CONVEX FUNCTIONS 20

∴ If x 6= 0, x

T ∇2f(w∗

)x > 0

Example where φ doesn’t have a full column rank,

φ =







x1 x

1 x

x2 x

2 x

xn x

n x







(15)

This is the simplest form of linear correlation of features, and it is not at all desirable.

5.4 Point 3

Definition of convex sets and convex functions (Cite : Definition 32 and 35)[2]

Figure 8: A sample convex function

∴ f(θx + (1 − θ)y) ≤ θf(x) + (1 − θ)f(y) (16)

Some convex functions : (Cite : Table 4.1, pg-54)[2]

To prove : Verify that a hyperplane is a convex set.

Proof :

5 LECTURE 5 : CONVEX FUNCTIONS 21

A Hyperplane H is defined as {x|a

T x = b, a 6= 0}. Let x and y be vectors that belong to the

hyperplane. Since they belong to the hyperplane, a

T x = b and a

T y = b. In order to prove the

convexity of the set we must show that :

θx + (1 − θ)y ∈ H, where θ ∈ [0, 1] (17)

In particular, it will belong to the hyperplane if it’s true that :

(θx + (1 − θ)y) = b (18)

=⇒ a

θx + a

(1 − θ)y = b (19)

=⇒ θa

T x + (1 − θ)a

T y = b (20)

(21)

And, we also have a

T x = b and a

T y = b. Hence θb + (1 − θ)b = b. [Hence Proved]

So a hyperplane is a convex set.

Q. Is ||φw − y||2

convex?

A. To check this, we have (Cite : Theorem 75)[2] but it is not very practical. We would use

(Cite : Theorem 79)[2] to check for the convexity of our function. So the condition that has

our focus is -

∇2

f(w∗

) is positive semi − def inite, if ∀x 6= 0, x

T ∇2f(w∗

)x ≥ 0 (22)

We have,

∇2

f(w) = 2φ

T φ (23)

So, ||φw − y||2

is convex, since the domain for w is R

n and is convex.

Q. Is ||φw − y||2

strictly convex?

A. Iff φ has full column rank.

(Side note : Weka1

)

5.5 Point 4

To prove: If a function is convex, any point of local minima ≡ point of global minima

Proof - (Cite : Theorem 69)[2]

To prove : If a function is strictly convex, it has a unique point of global minima

Proof - (Cite : Theorem 70)[2]

Since ||φw − y||2

is strictly convex for linearly independent φ,

∇f(w∗

) = 0 for w∗ = (φ

T φ)

−1φ

T y (24)

1http://www.cs.waikato.ac.nz/ml/weka/

5 LECTURE 5 : CONVEX FUNCTIONS 22

Thus, w∗

is a point of global minimum. One can also find a solution to (φ

T φw = φ

T y) by Gauss

elimination.

5.5.1 Overfitting

Figure 9: train-RMS and test-RMS values vs t(degree of polynomial) graph

Too many bends (t=9 onwards) in curve ≡ high values of some w

Train and test errors differ significantly

5.5.2 Next problem

Find

w∗ = argminw ||φw − y||2

s.t. ||w||p ≤ ζ, (25)

where

||w||p =

i=1

|wi

(26)

5.6 Point 5

Q. How to solve constrained problems of the above-mentioned type?

A. General problem format :

M inimize f(w) s.t. g(w) ≤ 0 (27)

5 LECTURE 5 : CONVEX FUNCTIONS 23

Figure 10: p-Norm curves for constant norm value and different p

Figure 11: Level curves and constraint regions

At the point of optimality,

Either g(w∗

) < 0 & ∇f(w∗

) = 0 (28)

Or g(w∗

) = 0 & ∇f(w∗

) = α∇g(w∗

) (29)

If w∗

is on the border of g, i.e., g(w∗

) = 0,

∇f(w∗

) = α∇g(w∗

) (30)

(Duality Theory) (Cite : Section 4.4, pg-72)[2]

Intuition: If the above didn’t hold, then we would have ∇f(w∗

) = α1∇g(w∗

)+α2∇⊥g(w∗

), where

by moving in direction ±∇⊥g(w∗

), we remain on boundary g(w∗

) = 0, while decreasing/increasing

value of f, which is not possible at the point of optimality.

6 Lecture 6 : Regularized Solution to Regression Problem

In last lecture, we derived solution for the regression problem formulated in least-squares sense

which was aimed at minimizing rms error over observed data points. We also analysed conditions

under which the obtained solution was guaranteed to be a global minima. However, as we observed,

increasing the order of the model yielded larger rms error over test data, which was due to large

fluctuations in model learnt and consequently due to very high values of model coefficients (weights).

In this lecture, we discuss how the optimization problem can be modified to counter very large

magnitudes of coefficients. Subsequently, solution of this problem is provided through lagrange

dual formulation followed by discussion over obtained solution and impact over test data. Towards

the end of the lecture, a very gentle introduction to axiomatic probability is provided.

6.1 Problem formulation

In order to cease coefficients from becoming too large in magnitude, we may modify the problem

to be a constrained optimization problem. Intuitively, for achieving this criterion, we may impose

constraint on magnitude of coefficients. Any norm for this purpose might give good working solution. However, for mathematical convenience, we start with the euclidean (L2) norm. The overall

problem with objective function and constraint goes as follows:

minimize

(Φw − Y )

such that ||w||2

2 ≤ ξ

(31)

As observed in last lecture, the objective function, namely f (w) = (Φw−Y )

(Φw−Y ) is strictly

convex. Further to this, the constraint function, g(w) =k w k

2 −ξ, is also a convex function. For

convex g(w), the set S = {w|g(w) ≤ 0}, can be proved to be a convex set by taking two elements

w1 ∈ S and w2 ∈ S such that g(w1) ≤ 0 and g(w2) ≤ 0. Since g(w) is a convex function, we have

the following inequality:

g(θw1 + (1 − θ)w2) ≤ θg(w1) + (1 − θ)g(w2)

≤ 0; ∀θ ∈ [0, 1], w1, w2 ∈ S

(32)

As g(θw1 + (1 − θ)w2) ≤ 0; ∀θ ∈ S, ∀w1, w2 ∈ S, θw1 + (1 − θ)w2 ∈ S, which is both sufficient

and necessary for S to be a convex set. Hence, function g(w) imposes a convex constraint over the

solution space.

6.2 Duality and KKT conditions

Given convex objective and constraint functions, minima, w∗

, can occur in one of the following two

ways:

1. g(w∗

) = 0 and Of(w∗

) = αOg(w∗

)

2. g(w∗

) < 0 and Of(w∗

) = 0

This fact might be easily visualized from Figure 1. As we can see, the first condition occurs when

minima lies on the boundary of function g. In this case, gradient vectors corresponding to function

6 LECTURE 6 : REGULARIZED SOLUTION TO REGRESSION PROBLEM 25

Figure 12: Two plausible scenario for minima to occur: a) When minima is on constraint function

boundary, in which case gradients point in the same direction upto constant and b) When minima

is inside the constraint space (shown in yellow shade), in which case Of(w∗

) = 0.

f and function g, at w∗

, point in the same direction barring multiplication by a constant α ∈ R.

Second condition depicts the case when minima lies inside the constraint space (interior of epigraph

of function g). This space is shown shaded in Figure 1. Clearly, for this case Of(w∗

) = 0 for minima

to occur. This primal problem can be converted to dual using lagrange multiplier. According to

which, we can convert this problem to objective function augmented by weighted sum of constraint

functions in order to get corresponding lagrangian. Such lagrangian might be depicted in the

following manner:

L(w, λ) = f(w) + λg(w); λ ∈ R

(33)

Here, we wish to penalize higher magnitude coefficients, hence, we wish g(w) to be negative while

minimizing the lagrangian. In order to maintain such direction, we must have λ ≥ 0. Also, for

solution w∗

to be feasible, Og(w∗

) ≤ 0. Due to complementary slackness condition, we further have

λg(w∗

) = 0, which roughly suggests that lagrange multiplier is zero unless constraint is active at

minimum point. As w∗ minimizes lagrangian L(w, λ), gradient must vanish at this point and hence

we have Of(w∗

) + λOg(w∗

) = 0. In general, optimization problem with inequality and equality

constraints might be depicted in the following manner:

minimizef(w)

subject to gi(w) ≤ 0;i = 1, 2, . . . , m

hj (w) = 0; j = 1, 2, . . . , p

(34)

Here, w ∈ R

n and domain is intersection of all functions. Lagrangian for such case might be

depicted in the following manner:

L(w, λ, µ) = f(w) +Xm

i=1

λigi(w) +Xp

j=1

µjhj (w)

(35)

6 LECTURE 6 : REGULARIZED SOLUTION TO REGRESSION PROBLEM 26

Lagrange dual function is given as the minimum value of the lagrangian over λ ∈ R

m, µ ∈ R

. Such

function might be given in the following manner:

z(λ, µ) = inf

L(w, λ, µ)

(36)

The dual function always yields lower bound for minimizer of the primal formulation. Such dual

function is used in characterization of a dual gap, which depicts suboptimality of the solution.

Duality gap may be denoted as the gap between primal and dual objectives, f(w) − z(λ, µ). When

functions f and gi

, ∀i ∈ [1, m] are convex and hj , ∀j ∈ [1, p] are affine, Karush-Kuhn-Tucker (KKT)

conditions are both necessary and sufficient for points to be both primal and dual optimal with zero

duality gap. For above mentioned formulation of the problem, KKT conditions for all differentiable

functions (i.e. f, gi

, hj ) with wˆ primal optimal and (λ, ˆ µˆ) dual optimal point may be given in the

following manner:

1. Of(wˆ ) + Pm

i=1 λˆ

iOgi(wˆ ) + Pp

j=1 µˆjOhj (wˆ ) = 0

2. gi(wˆ ) ≤ 0;i = 1, 2, . . . , m

3. λˆ

i ≥ 0;i = 1, 2, . . . , m

4. λˆ

igi(wˆ ) = 0;i = 1, 2, . . . , m

5. hj (wˆ ) = 0; j = 1, 2, . . . , p

6.3 Bound on λ in the regularized least square solution

As discussed earlier, we need to minimize the error function subject to constraint w≤ ξ . Applying

KKT conditions to this problem, if w∗

is a global optimum then from the first KKT condition we

get,

∇w∗ (f(w) + λg(w)) = 0 (37)

(38)

where, f (w) = (Φw − Y )

(Φw − Y ) and g(w) = kwk

2 − ξ

Solving we get,

2(ΦT Φ)w∗ − 2ΦT − 2λw∗ = 0

i.e.

w∗ = (ΦT Φ + λI)

−1Φ

T y

(39)

From the second KKT condition we get,

kw∗

2 ≤ ξ

(40)

From the third KKT condition,

λ ≥ 0

(41)

From the fourth condition

λkw∗

2 = λξ

(42)

6 LECTURE 6 : REGULARIZED SOLUTION TO REGRESSION PROBLEM 27

Thus values of w∗ and λ which satisfy all these equations would yield an optimum solution.

Consider equation (39),

w∗ = (ΦT Φ + λI)

−1Φ

T y

Premultiplying with (ΦT Φ + λI) on both sides we have,

(ΦT Φ + λI)w∗ = ΦT y

∴ (ΦT Φ)w∗ + (λI)w∗ = ΦT y

∴ k(ΦT Φ)w∗ + (λI)w∗

k = kΦ

T yk

By triangle inequality,

k(ΦT Φ)w∗

k + (λ)kw∗

k ≥ k(ΦT Φ)w∗ + (λI)w∗

k = kΦ

T yk (43)

Now , (ΦT Φ) is a nxn matrix which can be determined as Φ is known .

k(ΦT Φ)w∗k ≤ αkw∗k for some α for finite |(ΦT Φ)w∗k. Substituting in previous equation,

(α + λ)kw∗

k ≥ kΦ

T yk

i.e.

λ ≥

kΦ

T yk

kw∗k

− α (44)

Note that when kw∗k → 0, λ → ∞. This is obvious as higher value of λ would focus more on

reducing value of kw∗k than on minimizing the error function.

kw∗

2 ≤ ξ

Eliminating kw∗k from the equation (14) we get,

∴ λ ≥

kΦ

T yk

√

− α (45)

This is not the exact solution of λ but the bound (15) proves the existance of λ for some ξ and Φ.

6.4 RMS Error variation

Recall the polynomial curve fitting problem we considered in earlier lectures. Figure 2 shows RMS

error variation as the degree of polynomial (assumed to fit the points) is increased. We observe

that as the degree of polynomial is increased till 5 both train and test errors decrease. For degree

> 7, test error shoots up. This is attributed to the overfitting problem (The datasize for train set

is 8 points.)

Now see Figure 3 where variation in RMS error and Lagrange multiplier λ has been explored

(keeping the polynomial degree constant at 6). Given this analysis, what is the optimum value of λ

that must be chosen? We have to choose that value for which the test error is minimum (Identified

as optimum in the figure.).

6.5 Alternative objective function

Consider equation (37). If we substitute g(w) = kwk

2 − ξ, we get

∇w∗ (f(w) + λ · (kwk

2 − ξ)) = 0 (46)

This is equivalent to finding

min(k Φw − y k

2 +λ k w k

) (47)

6 LECTURE 6 : REGULARIZED SOLUTION TO REGRESSION PROBLEM 28

Figure 13: RMS error Vs degree of polynomial for test and train data.

For same λ these two solutions are the same.This form or regression is known as Ridge regression.

If we use L1 norm then it’s called as ’Lasso’. Note that w∗

form that we had derived is valid only

for L2 norm.

6.6 A review of probability theory

Let’s now review some basics of the probability theory. More details will be covered in the next

lecture.

Definition 1. Sample space (S) : A sample space is defined as a set of all possible outcomes of an experiment. Example of an experiment would be a coin pair toss. In this case S = {HH, HT, TH, TT}.

Definition 2. Event (E) : An event is defined as any subset of the sample space. Total number of

distinct events possible is 2

S, where S is the number of elements in the sample space. For a coin

pair toss experiment some examples of events could be

for at least one head, E = {HH, HT}

for all tails, E = {T T}

for either a head or a tail or both, E = {HH, HT, T H, T T}

Definition 3. Random variable (X) : A random variable is a mapping (or function) from set of

events to a set of real numbers. Continuous random variable is defined thus

X : 2S → R

On the other hand a discrete random variable maps events to a countable set (e.g. discrete real

numbers)

X : 2S → Discrete R

6 LECTURE 6 : REGULARIZED SOLUTION TO REGRESSION PROBLEM 29

Figure 14: RMS error Vs 10λ

for test and train data (at Polynomial degree = 6).

6.6.1 The three axioms of probability

Probability P r is a number corresponding to events . It satisfies following three axioms,

Axiom 1. For every event E, P r(E) ∈ [0, 1]

Axiom 2. P r(S) = 1 (Equivalently, P(∅) = 0)

Axiom 3. If E1, E2, . . . , En is a set of pairwise disjoint events, then

P r(

i=1

Ei) = Xn

i=1

P r(Ei)

6.6.2 Bayes’ theorem

Let B1, B2, ..., Bn be a set of mutually exclusive events that together form the sample space S. Let

A be any event from the same sample space, such that P(A) > 0. Then,

P r(Bi/A) = P r(Bi ∩ A)

P r(B1 ∩ A) + P r(B2 ∩ A) + · · · + P r(Bn ∩ A)

(48)

Using the relation P(Bi ∩ A) = P(Bi) · P(A/Bi)

P r(Bi/A) = P r(Bi) · P r(A/Bi)

j=1 P r(Bj ) · P r(A/Bj )

(49)

Example 1. A lab test is 99% effective in detecting a disease when in fact it is present. However,

the test also yields a false positive for 0.5% of the healthy patients tested. If 1% of the population

has that disease, then what is the probability that a person has the disease given that his/her test is

positive?

6 LECTURE 6 : REGULARIZED SOLUTION TO REGRESSION PROBLEM 30

Solution 1. Let, H be the event that a tested person is actually healthy.

D be the event that a tested person does have the disease.

T be the event that the test comes out positive for a person.

We want to find out P r(D/T)

H and D are disjoint events. Together they form the sample space.

Using Bayes’ theorem,

P(D/T) = P r(D) · P r(T /D)

P r(D) · P r(T /D) + P r(H) · P r(T /H)

(50)

Now, Pr(D) = 0.01 (Given)

Since Pr(D)+Pr(H)=1, Pr(H)=0.99

The lab test is 99% effective when the disease is present. Hence, Pr(T/D)=0.99

There is 0.5% chance that the test will give false positive for a healthy person. Hence, Pr(T/H)=0.005

Plugging these values in equation (50) we get,

P r(D/T) = 0.01 ∗ 0.99

0.01 ∗ 0.99 + 0.99 ∗ 0.005

What does this mean? It means that there is 66.66% chance that a person with positive test

results is actually having the disease. For a test to be good we would have expected higher certainty.

So, despite the fact that the test is 99% effective for a person actually having the disease, the false

positives reduce the overall usefulness of the test.

6.6.3 Independent events

Two events E1 and E2 are called independent iff their probabilities satisfy

P(E1 E2) = P(E1) · P(E2) (51)

where P(E1 E2)meansP(E1 ∩ E2)

In general, events belonging to a set are called as mutually independent iff, for every finite

subset, E1, · · · , En, of this set

P r(

i=1

Ei) = Yn

i=1

P r(Ei) (52)

7 Lecture 7 : Probability

This lecture gives an overview of the probability theory. It discusses distribution functions; notion

of expectation, variance; bernoulli and binomial random variables; and central limit theorem

7.1 Note

Pr - probability in general of an event

F - cumulative distribution function

p - probability distribution function(pdf) or probability mass function(pmf)

pdf – continuous random variable case

pmf – discrete random variable case

7.2 Part of speech(pos) example

Problem Statement:-

A set of ’n’ words, each of a particular part of speech(noun/verb/etc) is picked. Probability that a

word is of part of speech type ’k’ is pk. Assuming the picking of words is done independently, find

probability that the set contains a ’noun’ given that it contains a ’verb’.

Solution

Let Ak be the probability that the set contains pos type ’k’.

P r(Ak) = 1 − (1 − pk)

where (1 − pk)

n is that all ’n’ words are not of pos of type ’k’.

P r(Anoun/Averb) = P r(Anoun T

Averb)

P r(Averb)

P r(Ak1

Ak2) = 1 − (1 − pk1)

n − (1 − pk2)

n + (1 − pk1 − pk2)

P r(Anoun/Averb) = 1−(1−pnoun)

n−(1−pverb)

n+(1−pnoun−pverb)

1−(1−pverb)n

7.3 Probability mass function(pmf) and probability density function(pdf)

pmf :- It is a function that gives the probability that a discrete random variable is exactly equal to

some value(Src: wiki).

pX(a) = P r(X = a)

Cumulative distribution function(Discrete case)

F(a) = P r(X <= a)

pdf :- A probability density function of a continuous random variable is a function that describes the

relative likelihood for this random variable to occur at a given point in the observation space(Src:

7 LECTURE 7 : PROBABILITY 32

wiki).

P r(X ∈ D) = R

p(x)dx

where D is set of reals and p(x) is density function.

Cumulative distribution function(Continuous case)

F(a) = P r(X <= a) = R a

−∞ p(x)dx

f(a) = dF (x)

dx |x=a

7.3.1 Joint distribution function

If p(x,y) is a joint pdf i.e. for continuous case:

F(a, b) = P r(X <= a, Y <= b) = R b

−∞

R a

−∞ p(x, y)dxdy

p(a, b) = ∂

2F (x,y)

∂x∂y |a,b

For discrete case i.e. p(x,y) is a joint pmf:

F(a, b) = P

x<=a

y<=b

p(x, y)

7.3.2 Marginalization

Marginal probability is then the unconditional probability P(A) of the event A; that is, the probability of A, regardless of whether event B did or did not occur. If B can be thought of as the event

of a random variable X having a given outcome, the marginal probability of A can be obtained

by summing (or integrating, more generally) the joint probabilities over all outcomes for X. For

example, if there are two possible outcomes for X with corresponding events B and B’, this means

that P(A) = P(A

B) + P(A

). This is called marginalization.

Discrete case:

P(X = a) = P

p(a, y)

Continuous case:

Px(a) = R ∞

−∞ p(a, y)dy

7.4 Example

Statement :- X and Y are independent continuous random variables with same density functions.

p(x) = (

−x

if x > 0;

0 otherwise.

7 LECTURE 7 : PROBABILITY 33

Find density X

Note:- They are indepedent.

Solution

F X

(a) = P r(

Y <= a)

R ∞

R ya

p(x, y)dxdy

R ∞

R ya

−x

−ydxdy

= 1 −

a+1

f X

(a) = derivative of F X

(a) w.r.t a

(a+1)2 > 0

7.5 Conditional Density

Discrete case:

pX(

Y =y

) = P(

X=x

Y =y

) = P (X=x,Y =y)

P (Y =y)

Continuous case:

pX(

Y =y

) = pX,Y ( X

)

pY (y) =

pX,Y ( X

)

R ∞

−∞ p(x,y)dx

7.6 Expectation

Discrete case: Expectation is equivalent to probability weighted sums of possible values.

E(X) = ΣixiP r(xi) where X is a random variable

Continuous case: Expectation is equivalent to probability density weighted integral of possible

values.

E(X) = R ∞

−∞ xp(x)dx

If the random variable is a function of x, then Discrete case:

E(X) = Σif(xi)P r(xi) where X is a random variable

Continuous case:

E(X) = R ∞

−∞ f(x)p(x)dx

7 LECTURE 7 : PROBABILITY 34

7.6.1 Properties of E(x)

E[X + Y ] = E[X] + E[Y ]

For any constant c and any random variable X

E[(X − c)

] ≥ E[(X − µ)

]

where µ = E[X]

E[cX] = cE[X]

7.7 Variance

For any random variable X, variance is defined as follows:

V ar[X] = E[(X − µ)

]

⇒ V ar[X] = E[X2

] − 2µE[X] + µ

⇒ V ar[X] = E[X2

] − (E[X])2

V ar[αX + β] = α

2V ar[X]

7.8 Covariance

For random variables X and Y, covariance is defined as:

Cov[X, Y ] = E[(X − E(X))(Y − E(Y ))] = E[XY ] − E[X]E[Y ]

If X and Y are independent then their covariance is 0, since in that case

E[XY ] = E[X]E[Y ]

However, covariance being 0 does not necessarily imply that the variables are independent.

7.8.1 Properties of Covariance

Cov[X + Z, Y ] = Cov[X, Y ] + Cov[Z, Y ]

Cov[ΣiXi

, Y ] = ΣiCov[Xi

, Y ]

Cov[X, X] = V ar[X]

7.9 Chebyshev’s Inequality

Chebyshev’s inequality states that

if X is any random variable with mean µ and variance σ then ∀k > 0

P r[|X − µ| ≥ k] ≤

7 LECTURE 7 : PROBABILITY 35

If n tends to infinity, then the data mean tends to converge to µ, giving rise to the weak law

of large numbers.

P r[|

X1+X2+..+Xn

n − µ| ≥ k] tends to 0 as n tends to ∞

7.10 Bernoulli Random Variable

Bernoulli random variable is a discrete random variable taking values 0,1

Say, P r[Xi = 0] = 1 − q where q[0, 1]

Then P r[Xi = 1] = q

E[X] = (1 − q) ∗ 0 + q ∗ 1 = q

V ar[X] = q − q

2 = q(1 − q)

7.11 Binomial Random Variable

Binomial random variable is discrete variable where the distribution is a series of n experiments

with 0,1 value. The probability that the outcome of a particular experiment is 1 being q.

P r[X = k] = n

qk(1 − q)

n−k

E[X] = ΣiE[Yi

] where Yi

is a bernoulli random variable E[X] = nq

V ar[X] = ΣiV ar[Yi

] (since Yi

’s are independent)

V ar[X] = nq(1 − q)

An example of Binomial distribution can be a coin tossed n times and counting the number of

times heads shows up.

7.12 Central Limit Theorem

If X1, X2, .., Xm is a sequence of i.i.d. random variables each having mean µ and variance σ

Then for large m, X1 + X2 + .. + Xm is approximately normally distributed with mean mµ and

variance mσ

If XN˜(µ, σ2

)

Then P[x] = 1

√2

2π

−(x−µ)

2σ2

It can be shown by CLT

X1+X2+..+Xn−nµ

√2 n N˜(0, 1)

Sample Mean: ˆµN˜(µ, σ

m )

7 LECTURE 7 : PROBABILITY 36

7.13 Maximum Likelihood and Estimator

Estimator is a function of random space which is meant to approximate a parameter.

µˆ is an estimator for µ

Maximum Likelihood estimator ˆqMLE for q

qˆMLE for q (Parameter of Bern(q) from which we get sample data)

L(Xˆ

1, Xˆ

2, .., Xˆ

n|q) = q

Xˆ1 (1 − q)

1−Xˆ1

..qXˆn (1 − q)

1−Xˆn

qˆMLE = argmaxqL(Xˆ

1, Xˆ

2, .., Xˆ

n|q)

E.g.: Bernoulli Random Variable

p(x) = µ

(1 − µ)

(1−x)

D = X1, X2, .., Xmis a random sample

L(Dkµ) = Πiµ

(1 − µ)

(1−xi)

GOAL :

µˆMLE = argmaxµL(D|µ)

Equivalently :

µˆMLE = argmaxµlogL(D|µ)

dlogL(D|µ)

dµ = 0 gives ˆµMLE =

ΣXi

Summary :

1. µˆMLE is a function of the random sample

2. It is called an estimator in terms of Xi

’s

3. It is called an estimate in terms of xi

4. It coincides with sample mean

Recall from CLT

for large m ΣiXi

is similar to N(mµ, mσ2

) if each Xi has

E(Xi) = µ

V (Xi) = σ

Thus :

7 LECTURE 7 : PROBABILITY 37

ΣiXi−mµ

√2 m N˜(0, 1) and ˆµMLEN˜(µ, σ

m )

Question : Given an instantiation of X1, X2, .., Xm called Data D x1, x2, .., xm

You have MLE estimate

ΣXi

m , which is a point estimate

How confident can you be that the actual µ is ΣXi

m ± z for some z, this is called the interval estimate

7.14 Bayesian estimator

L(X1, X2, .., Xn|µ) = Πiµ

Xi (1 − µ)

1−Xi

p(µ) = 1

∀µ(0, 1)θ ≥ 0

R 1

p(µ) dµ = 1 , implies that θ = 1

Posterior = P(µ|x1, x2, .., xn) (Bayesian Posterior)

L(X1,X2,..,Xnkµ)p(µ)

R 1

0 L(X1,X2,..,Xn|µ)p(µ)|,dµ

Σixi 1−µ

Σi1−xi

R 1

µΣixi 1−µΣi1−xi dµ

Bayes Estimate E(µ|x1, x2, .., xn) = R

µP(µ|x1, x2, .., xn) dµ

Expected µ under posterior = Σm

i xi+1

m+2

Expected value under posterior of parameter µ is called bayes estimate

Beta Distribution is the conjugate prior for Bernoulli distribution

β(µ|a, b) = Γ(a+b)µ

(a−1)(1−µ)

(

b−1)

Γ(a)Γ(b)

Note : Prior and Likelihood should have same form for posterior to have same form as prior. If so

,the chosen prior is called a conjugate prior.

8 LECTURE 8 38

8 Lecture 8

8.1 Bernoulli Distribution

The general formula for probability of a random variable ’x’ is

p(x) = µ

(1 − µ)

1−x

The likelihood of the data given µ is

L(D|µ) = Ym

i=1

(1 − µ)

1−xi

Our goal is to find the maximum likelihood estimate ˆµMLE = argmaxµ L(D|µ)

Since log is a monotonically increasing function, we can write ˆµMLE equivalently as :

µˆMLE = argmaxµ LL(D|µ)

where, LL represents the log of the likelihood of the data given µ.

This can also be represented as

µˆMLE = argmaxµ log(L(D|µ))

⇒ µˆMLE = argmaxµ

i=1

[Xi

ln µ + (1 − Xi) ln(1 − µ)]

⇒ µˆMLE = argmaxµ ln µ (

i=1

Xi) + ln(1 − µ)

i=1

(1 − Xi)

such that, 0 ≤ µ ≤ 1

To find the maxima for LL(D|µ), we put d LL(D|µ)

dµ = 0

⇒ µˆMLE =

P i=1 Xi

i=1 Xi +

i=1(1−Xi) =

i=1 Xi

Thus, we know that

(1) ˆµMLE is a function of the random sample.

(2) It is called an estimator in terms of Xis.

(3) It is called an estimate in terms of xis.

(4) It coincides with the sample mean.

From central-limit theorem, we know that for large m

i=1

Xi ∼ N(mµ, mσ2

)

If each Xi has E[Xi

] = µ and V [Xi

] = σ

i.e Xis are normally distributed over m with mean µ and variance σ

8 LECTURE 8 39

Thus,

i=1 Xi − mµ

√

∼ N(0, 1)

and,

µˆMLE ∼ N(µ,

)

Question. Given an instantiation of ( X1, X2, ...., Xm ) called training data D (x1, x2, ...., xm ),

you have Maximum Likelihood Estimation

xi/m

(point estimate). How confident can you be that actual µ is within (Pxi)/m ± Z for some Z.

Answer: Here, we are looking for an interval estimate.

µˆMLE ± Z

µˆ ∗ (1 − µˆ)

where we lookup Z from a table for standard normal distribution value of α for given Z such that

Pr(x≥Z) = α

µ ∈ (ˆµMLE ± Z

µˆ∗(1−µˆ)

m )is(1 − 2α)

8.2 Bayesian Estimation

Likelihood L( X1, X2, ...., Xm|µ) = Qm

i=1 µ

xi ∗ (1 − µ)

1−xi

p(µ) = 1

θ ≥ 0 for all µ ∈ [0, 1]

8 LECTURE 8 40





p(µ) dµ = 1 ⇒ θ = 1

Posterior (Bayesian Posterior) is:

pr(µ|X1, X2, ...., Xm) = L(x1,x2,....,xm|µ)∗p(µ)

R 1

0 L(x1,x2,....,xm|µ)p(µ)dµ

Σxi ∗ (1 − µ)

Σ(1−xi) ∗ 1

R 1

µΣxi ∗ (1 − µ)

Σ(1−xi)dµ

(m + 1)!µ

Σxi∗(1−µ)

Σ(1−xi)

Pxi

!(m −

Pxi)!

Expectation E(µ|x1, x2, ...., xm) =



µ ∗ p(µ|x1, x2, ...., xm)d(µ)

Expected is under posterior =

i=1 xi+1

m+2

Thus if we tossed a coin 2 times and Xi = 1, X2 = 1 then

E(µ|1, 1) = 2+1

2+2 =

µˆB = E[µ|D] = p(µ|D)

Beta(µ|a, b) = Γ(a + b)

Γ(a)Γ(b)

a−1µ

b−1

Beta is conjugate prior to bernoulli distribution and Γ(a+b)

Γ(a)Γ(b)

is a normalization constant

L(x1...xn) = µ

i=1xi

(1 − µ)

i=1(1−xi)

Prior should have the same form as the likelihood with some normalization const

p(µ|D) ∝ L(D|µ)p(µ)

If µprior ≈ Beta(µ|a, b)

p(µ|x1...xn) = L(x1...xn|µ)P(µ) = L(x1...xn|µ)Beta(a, b) = Z 1

L(x1...xn|µ)Beta(µ|a, b)du

Γ(m + a + b)µ

i=1xi+a−1

(1 − µ

i=1(1−xi)+b−1

)

Γ(Σn

i=1xi + a)Γ(Σn

i=1(1 − xi) + b)

≈ Beta(Σn

i=1xi + a, Σ

i=1(1 − xi) + b)

EBeta(a,b)(µ|x1...xn) = a

a + b

EBeta(a+Σn

i=1x,b+Σn

i=1(1−xi))(µ|x1...xn) = Σ

i=1xi + a

Σn

i=1(1 − xi) + Σn

i=1xi + a + b

for large m, a and b m

8 LECTURE 8 41

µˆBayes → µˆMLE

Let us say we make k more obsrvations: y1...yk

≈ Beta(Σn

i=1xi + Σn

i=1yi + a, Σ

i=1(1 − xi) + Σn

j=1(1 − yj ) + b)

Multinomial

Say : observation are dice tossing, there are n possible outcomes and each modeled by a vector Xk

= (010203.....1k.....)

i = 1 (Note: Xk

k = 1 and Xk

i = 0 for all i 6= k )

Also, p(X = x

) = yk ≈ such that Pn

k=1 µk = 1

Then, p(X = x) = Qn

i=1 µ

E(x) = Xn

k=1

∗ p(Xk

) =







µ1

µ2

µ3

µn







9 LECTURE 9 : MULTINOMIAL DISTRIBUTION 42

9 Lecture 9 : Multinomial Distribution

This lecture was a continuation of the previous discussion, where we were discussing Multinomial

Distribution.Here we discuss about conjugate prior and posterior of Multinomial Distribution. Then

we extend the discussion to Gaussian Distribution.

Question: What will be conjugate prior αi

’s, which are params of Multinomial?

Answer: Joint Distribution.

P (µ1, . . . µn|α1, . . . αn) ∝ π

i=1µ

αi−1

(53)

Note: For normalising constant, if P(x) ∝ f(x)

P(x) = 1

f(x)dx f(x)

Since

R n

i=1 P (µ1, . . . µn|α1, . . . αn) = 1

By Integrating, we get normalisation constant.

P (µ1, . . . µn|α1, . . . αn) = Γ(Pn

i=1 αi)

i=1Γ(αi)

i=1µ

αi−1

(54)

which follows Dir(α1 . . . αn)

Dirichlet distribution is a generalisation of Beta distribution, just as multinomial is generalisation

of Bernouli distribution.

9.0.1 Posterior probability

P (µ1, . . . µn|X1, . . . Xm) = P (X1,...Xm|µ1,...µn)P (µ1,...µn)

P (X1,...Xm)

⇒ P (µ1, . . . µn|X1, . . . Xm) = Γ(Pn

i=1 αi + m)

i=1Γ(αi +

k=1 Xk,i)

i=1µ

(αi−1+Pm

k=1 Xk,i)

(55)

9.0.2 Summary

For multinomial, the mean at maximum likelihood is given by:

µˆiMLE =

k=1 Xk,i

(56)

Conjugate prior follows Dir(α1 . . . αn)

Posterior is Dir(. . . αi +

k=1 Xk,i . . .)

The expectation of µ for Dir(α1 . . . αn) is given by:

E [µ]Dir(α1...αn) =

α1

. . . P

α1

αi

(57)

9 LECTURE 9 : MULTINOMIAL DISTRIBUTION 43

The expectation of µ for Dir(. . . αi +

k=1 Xk,i . . .) is given by:

E [µ]Dir(...αi+

k=1 Xk,i...) =

α1 +

k Xk,1

α1 + m

. . .

αj +

k Xk,j

αi + m

. . .

(58)

Observations:

α1 . . . αn = 1 ⇒ Uniform Distribution

As m→ ∞ ⇒ µˆBayes → µˆMLE

If m=0, ˆµBayes = µprior

9.1 Gaussian Distribution

9.1.1 Information Theory

Let us denote I(X=x) as the measure of information conveyed in knowing value of X=x.

Figure 15: Figure showing curve where Information is not distributed all along.

Figure 16: Figure showing curve where Information is distributed.

Question:Consider the following two graphs, say you know probability function p(x), then when is

knowing value of X more useful(carries more information).

Ans: It is more useful in the case(2), because more information is conveyed in Figure 15 than in

Figure 16.

9 LECTURE 9 : MULTINOMIAL DISTRIBUTION 44

9.1.2 Expectation for I(X=x):

If X and Y are independant random variables from same distribution.

I(X = x, Y = x) = I(X = x) + I(Y = y) (59)

The above equation can be equivalently stated as follows:

I(P(x)P(y)) = I(P(x)) + I(P(y)) (60)

where P(x),P(y) are the probability functions respectively.

If p(x)>P(y) , then

I(p(x))<I(p(y))

There is only one function which satisfies the above two properties.

I(p(x)) = −c log(p(x)) (61)

The Entropy in the case of discrete random variable can be defined as:

EP [I(p(x))] = X

−c log[p(x)] (62)

In the case of continuous random variable it is,

EP [I(p(x))] = Z

−c log[p(x)] (63)

The constant ’C’ in the above two equations is traditionally 1.

9.1.3 Observations:

For Discrete random variable(∼ countable domain), the information is maximum for Uniform

distribution.

For Continuous random variable ( ∼ Finite mean and finite variance), the information for

Gaussian Distribution.

Finding argmaxpEp ∼ Infinite domain, subject to

xp(x)dx = µ, R

(x − µ)

2p(x)dx = σ

The solution would be

p(x) = e

−(x−µ)

2σ2

√

2πσ2

9 LECTURE 9 : MULTINOMIAL DISTRIBUTION 45

9.1.4 Properties of gaussian univariate distribution

If X ∼ N(µ, σ2

)

p(x) = 1

√

2Π e

−(x−µ)

2σ2 where − ∞ < x < ∞

then w1X + w0 ∼ N(w1µ + w0, w2

1σ

)

(can prove this using moment generating function)

Φ(N(µ, σ2

)) = EN(µ,σ2)

tx] = e

µt+

(σt)

Recall

E(X) = dφ(p)

var(x) = d

2φ(p)

dt2

EN(µ,σ2)

t(w1x+w0)

] = (w1µt + w0t +

(σt)

2 × w

) ∼ N(w1µ + w0, w2

1σ

)

Sum of i.i.d X1, X2, ......, Xn ∼ N(µ, σ2

) is also normal(gaussian)

X1 + X2 + ...... + Xn ∼ N(nµ, nσ2

)

In genaral if Xi ∼ N(µi

, σ2

) =⇒

i=1 Xi ∼ N(

Pµi

Pσ

)

Corollary from (1) If X ∼ N(µ, σ2

)

z =

X−µ

σ ∼ N(0, 1) (Useful in setting interval estimate)

(take w1 =

and w0 =

)

Maximum Likelihood estimate for µ and σ

Given X1, X2, ....Xm..... Random Sample.

µˆMLE = argmaxµ

i=1[

√

2π

−(Xi−µ)

2σ2 ]

= argmaxµ

√

2π

−

P(Xi−µ)

2σ2

µˆMLE =

i=1 Xi

m = sample mean

With out relaying on central limit theorem Properties (2) and (1)

9 LECTURE 9 : MULTINOMIAL DISTRIBUTION 46

i.e. Sum of i.i.d’s X1, X2, ......, Xn ∼ N(µ, σ2

)

µˆMLE = N(µ, σ

m )

Similarly

σˆ

MLE =

i=1(Xi−µˆMLE)

m is χ

2 distrbution

∼ χ

Note:- If X1, X2, ....Xm ∼ N(0, 1)

i X2

i ∼ χ

m m-degree of freedom

Coming up with conjugate prior of N(µ, σ2

)

Case (1) σ

is fixed and prior on µ

⇒ µ ∼ N(µ0, σ2

)

Case (2) µ is fixed and σ

2 has prior

⇒ σ

2 ∼ Γ

case (3) if µ and σ

2 both having the prior

⇒ (µ, σ2

) ∼ Normal gamma distribution ∼ Students-t distribution

10 Lecture 10 : Multivariate Gaussian Distribution

We start the lecture by discussing the question given in the previous lecture and then move over to

Multivariate Gaussian Distribution.

The question was : If X1...Xm ∼ N(µ, σ2

) . Then assuming σ

is known

µˆML =

i=1

ˆσ

2ML =

i=1

(Xi − µ)

Here ˆσ

2ML follows the chi − squared distribution

Figure 10 : Figure showing the nature of the (chi − square) distribution of ˆσ

2ML

LL(D|µ, σ) =

i=1

(Xi − µ)

2σ

− ln(σ

√

2π)

ie, L(D|µ, σ) = 1

√

2π

i=1

exp(

−(Xi − µ)

2σ

)

Question : What is the conjugate prior p(µ) if σ

is known?

Answer : Gaussian distribution.

Question : What is the conjugate prior p(σ

) if µ is known?

Answer : Gamma distribution.

Question : What is the conjugate prior p(µ, σ2

) if both are unknown?

10 Lecture 10 : Multivariate Gaussian Distribution 48

Answer : If Xi ∼ N(0, 1) then

i=1

Xi ∼ X 2

m and

y = X

∼ tn . (where z ∼ N (0, 1))

Here y follows students-t distribution

Figure 10: Figure showing students-t distribution of y

10.1 Multivariate Gaussian Variable

Definition : If X ∼ N µΣ) (where x ∈ Rn)) then

p(x) = 1

(2π)

2 |Σ|

exp

−(x − µ)

T Σ

−1

(x − µ)

(Note : In pattern recognition, (x − µ)

T Σ

−1

(x − µ) is called the Mahalanobis distance between x

and µ. If Σ = I, it is called Eucledian distance. )

i) Σ can be assumed to be symmetric.

A = Asym + Aantisym

ii) If Σ is symmetric

Σ = Xn

i=1

λi(qiq

) where qi

’s are orthogonal.

Here Σ−1 =

i=1

λi

qiq

0 = Q(x − µ) (Q is a matrix of q

s as column vectors)

p(x

0) = Yn

j=1

(

2πλj

)exp

−

i=1

)

λi

2πλj

Here the joint distribution has been decomposed as a product of marginals in a shifted and rotated co-ordinate system.

10 Lecture 10 : Multivariate Gaussian Distribution 49

ie, x

s are independent.

LL(x1...xm|µ, Σ) = −n

ln(2π) −

ln[|Σ|] −

−1

i=1

((xi − µ)

T Σ

−1

(xi − µ))

Set ∇µLL = 0, ∇ΣLL = 0

∇µLL = [−1

X2(xi − µ)]Σ−1 = 0

Since Σ is invertible,

X(xi − µ) = 0

ie, µ =

Xxi

ΣˆML =

i=1

(xi − µˆML)(xi − µˆML)

Here ΣˆML is called emperical co-variance matrix in statistics.

µˆML ∼ N(µ, Σ)

E[ˆµML] = µ

Here ˆµML is an unbiased estimator.

10.1.1 Unbiased Estimator

An estimator e(θ) is called unbiased estimator of θ if E[e(θ)] = θ

If ei(θ), e2(θ), ..., ek(θ) are unbiased estimators and X

i=1

λi = 1 then X

i=1

λiei(θ) is also unbiased

estimator.

Since E(ΣˆML) = m − 1

Σ, ΣˆML is a biased estimator.

An unbiased estimator for Σ is therefore 1

m − 1

i=1

(xi − µˆML)(xi − µˆML)

Question : If ∼ N(0, σ2

)

y = wT φ(x) + where w, φ(x) ∈ Rn

10 Lecture 10 : Multivariate Gaussian Distribution 50

then y ∼ N(wT φ(x), σ2

)

p(y|x, w) = 1

√

2πσ2

exp(

(y − φ

(x)w)

2σ

)

E[Y (w, x)] = wT φ(x) = wT

0 + wT

1 φ1(x) + ... + wT

n φn(x)

φ(x) = [1 φ1(x) ... φn(x)]

Given random sample D

D =







y1 φ1(x)

y2 φ2(x)

: :

yn φn(x)







LL(y1...xm|x1...xm, w, σ2

) = −m

ln(2πσ2

) −

2σ

i=1

(wT φ(x)i − yi)

Given σ

wˆ ML = argmax LL(y1...ym|x1...xm, w, σ2

)

= argmax Xm

i=1

(wT φ(x)i − yi)

10.2 Dealing with Conjugate Priors for Multivariate Gaussian

The congjugate prior for multivariate gaussian distibution if µ, σ2 are known is given as

P(µ) = N (µ0, σ2

)

P(x) ∼ N (µ, σ2

)

P(µ|x1...xm) = N (µm, σ2

µm = ( σ

mσ2

0 + σ

µ0) + ( mσ2

mσ2

0 + σ

µˆML)

11 Lecture 11

11.1 Recall

For bayesian estimation in the univariate case with fixed σ where µ ∼ N (µ0, σ2

0) and x ∼ N (µ, σ2

)

µm

µˆmle + µ0

such that p(x|D) ∼ N (µm, σm

2). m/σ2 is due to noise in observation while 1/σ2

is due to

uncertainity in µ. For the Bayesian setting for the multivariate case with fixed Σ

x ∼ N (µ, Σ), µ ∼ N(µ0, Σ0) & p(x|D) ∼ N (µm, Σm)

−1

m = mΣ

−1 + Σ−1

−1

m µm = mΣ

−1µˆmle + Σ−1

0 µ

11.2 Bayes Linear Regression

The Bayesian interpretation of probability can be seen as an extension of logic that enables reasoning

with uncertain statements. Bayesian linear regression is a Bayesian alternative to ordinary least

squares regression.

y = w

T φ(x) + ε

ε ∼ N(0, σ2

)

w ∼ N(0, Σ0)

wˆMLE = (φ

T φ)

−1y

Finding µm and Σm :

−1

m µm = Σ−1

0 µ0 + φ

y/σ2

−1

m = Σ0 +

T φ

Setting Σ0 = αI and µ0 = 0

−1

m µm = φ

y/σ2

−1

m µm =

I + φ

T φ/σ2

µm =

(I/α + φ

T φ/σ2

)

−1φ

T y

But since σ

2/α is nothing but the λ in ridge regression, this can be written as

µm = (λI + φ

T φ/σ2

)

−1φ

−1

m =

T φ

11 Lecture 11 53

which are similar to the results obtained in ridge regression.

What is the Bayes Estimator here?

wˆBayes = Ep(w|X1...Xm)

[w]

= (φ

T φ +

)

−1φ

which is the least square solution for ridge regression.

wˆMAP = argmaxw p(w|X1...Xn)

which is the point w at which posterior distribution peak (mode).

For Gaussian distribution, mode is the same as the mean.

mean = ˆwBayes. (64)

Find P(y|X1...Xm) ∼ for linear regression

In the context of multivariate gaussian.

p(x|D) = p(X|X1...Xm)

p(x|µ)p(µ|D)dµ

∼ N(µn, Σ + Σn)

Point? p(x|D)

MLE ˆθMLE = argmaxθ LL(D|θ) p(x|θMLE)

Bayes Estimator ˆθB = Ep(θ|D)E[θ] p(x|θMLE)

MAP ˆθMAP = argmaxθ p(θ|D) p(x|θMAP )

Pure Bayesian p(θ|D) = p(D|θ)p(θ)

p(D|θ)p(θ)dθ

p(D|θ) = Ym

i=1

p(xi

|θ)

p(x|D) = Z

p(x|θ)p(θ|D)dθ

where θ is the prior vector.

11 Lecture 11 54

11.3 Pure Bayesian - Regression

p(y|X1...Xm) = Z

p(y|w)p(w|D)dw

where

y = w

T φ(X) + ε

ε ∼ N(0, σ2

)

w ∼ N(0, αI)

∼ N(µ

mφ(x), σ2

Also we have

m = σ

2 + φ

T Σmφ

−1

m =

T φ

µm =

Σmφ

T y

µm = (φ

T φ +

)

−1φ

Since x ∼ N(µ, σ2

), we have αx + β = N(αµ + β, α2σ

). And since we know that ε ∼ N(0, σ2

Using y = w

T x + ε, we get

y ∼ N(w

T x, σ2

)

11.4 Sufficient Statistic

A Statistic is called sufficient for θ if p(D|s, θ) is independent of θ. It can be proved that a statistic

s is sufficient for θ iff p(D|θ) can be written as p(D|θ) = g(s, θ)h(D). For case of gaussian we have

g(s, µ) = exp(−m/2µ

T Σ

−1µ + µ

T Σ

−1Xn

i=1

xi)

h(x1, x2...xm) = 1/2π

nm/2

|Σ|

m/2

exp(−1/2

i=1

i Σ

−1xi)

Thus, we see that for the normal distribution, p(D|µ) = g(s, µ)h(D).

11.5 Lasso

We have Y = wT φ(x) + ε where ε ∼ N(0, σ2

). Here w ∼ Laplace distribution. Then it turns out

that ˆwMAP with laplace prior is ˆwMAP = argmaxw

i=1

||w

T φ(xi) − yi

||2 + λ||w||l1

Here λ||w||l1 is basically the penalty function, also called lasso. Recall ˆwMAP with guassian prior

is ˆwMAP = argmaxw

i=1

||w

T φ(xi) − yi

||2 + λ||w||2

11 Lecture 11 55

Here λ||w||2

is basically penalty function also called ridge regression.

Refer Section 7 from Sir’s notes for more details on this topic.

Lasso generally yields sparser solutions. What this means is that if you have φ1(x), φ2(x)....φk(x)

each of them with weights w1, w2...wk We may want to reduce the number of non-zero k. In general,

if we use ridge regression those parameter which are irrelevant may have some small weights but

lasso tends to set have has zero weights

12 LECTURE 12 : BIAS-VARIANCE TRADEOFF 56

12 Lecture 12 : Bias-Variance tradeoff

Key terms : Bias, Variance, Expected loss, Noise

When we design a particular machine learning model (function), we wish that it should be able

to predict the output accurately and it should be able to do it independent of the sample training

data it has been trained with. In the following section we are going to show that there is a trade

off between the two main objectives. Let us understand the terms bias and variance. Variance of

a machine learning model is the variance in the prediction of a value when trained over different

training data. Bias is related to how much the prediction varies from the actual observed values in

the training data.

12.1 Expected Loss

Suppose we are interested in finding the expected loss value of the function. We are given a training

data TD, a distribution over x and target variable y say P(x, y) and desired fuction f(x) (here by

f(x) , we actually mean f(x, TD) , i.e. a function trained with respect to the training data TD)to

approximate y. We want to find out the expected loss of a model with respect to all the data we

have in our distribution. Usually squared error is chosen as a measure of loss the model. First we

will derive the loss for a training data and then for the whole distribution.

L(y, x, f(x)) = (f(x) − y)

We want to find a function f(.) whose expected value of the loss over the instances in the training

data, is minimum. Find

argminf Ef,P (x,y)

[L(y, x, f(x))]

where Ef,P (x,y)

[L(y, x, f(x))] denotes expectation of the loss with a fixed f() and P(,).

Ef,P (x,y)

[L] = Z

L(y, x, f)P(x, y)dxdy

= Z

(f(x) − y)

2P(x, y)dxdy

= Z

(f(x) − E[y/x] + E[y/x] − y)

2P(x, y)dxdy

= Z

(f(x) − E[y/x])2P(x, y)dxdy +

(E[y/x] − y)

2P(x, y)dxdy

+ Z

2(f(x) − E[y/x])(E[y/x] − y)P(x, y)dxdy

Let us rewrite the third term.

2(f(x) − E[y/x])(E[y/x] − y)P(x, y)dxdy =

2(f(x) − E[y/x]) R

(E[y/x] − y)P(y/x)dyP(x)dx(65)

Since R

yP(y/x)dy = E[y/x], the inner integral in (65) is 0. So

Ef,P (x,y)

[L] = Z

(f(x) − E[y/x])2P(x, y)dxdy +

(E[y/x] − y)

2P(x, y)dxdy (66)

Here the second term is independent of f(). We have to consult only the first term to find out the

f() which will minimize th expecation. From the first term, it is clear that minimum loss will

12 LECTURE 12 : BIAS-VARIANCE TRADEOFF 57

be obtained when f(x) equals E[y/x]. The minimum expected loss is given by

Ef,P (x,y)

[L] = Z

(E[y/x] − y)

2P(x, y)dxdy (67)

This is the minimum loss we can expect for a given training data. Now let us find out the

expected loss over different training data. Let

ETD [f(x, TD)] = Z

f(x, TD)p(TD)dTD

Earlier we have found that the only tweakable component in expected loss is (f(x)−E[y/x])2

. Now

we will find out the expected loss over all the trainind data by finding expectation of the expression

over the distribution of training data.

(f(x, TD) − E[y/x])2

p(TD)dTD = ETD [(f(x, TD) − E[y/x])2

]

= ETD

f(x, TD) − ETD [f(x, TD)] + ETD [f(x, TD)] − E[y/x]

= ETD

f(x, TD) − ETD [f(x, TD)] 2

ETD [f(x, TD)] − E[y/x]

− 2

f(x, TD) − ETD [f(x, TD)] ETD [f(x, TD)] − E[y/x]

Since

ETD

f(x, TD)

= ETD [f(x, TD)]

and the other factors are independent of TD, the third term vanishes. Finally

(f(x) − E[y/x])2

p(TD)dTD = ETD

f(x, TD) − ETD [f(x, TD)] 2

ETD [f(x, TD)] − E[y/x]

= ETD

f(x, TD) − ETD [f(x, TD)] 2

ETD [f(x, TD)] − E[y/x]

= V ariance + Bias2

Variance of f(x, D) = ETD

f(x, TD) − ETD [f(x, TD)] 2

and Bias = ETD [f(x, TD)] − E[y/x]

Putting back in (66) the Expectedloss = V ariance + Bias2 + Noise

Let us try to understand what this means. Consider the case of regression. The loss of the

prediction depends on many factors such as complexity of the model (linear, ..) the parameters and

the measurements etc. The noise in the measurement can cause loss of prediction. That is given

by third term. Similarly the complexity of the model can contribute to the loss.

If we were to take the linear regression with a low degree polynomial, we are introducing a bias

that the dependency of the predicted variable is simple. Similarly when we add a regularizer term,

we are implicitly telling that the weights are not big, is also a kind of bias. The prediction we

otained may not be accurate. In these cases the predicted values may not have much correlation

with the sample points we took. So the predictions remains more or less the same over different

samples. That is for different samples the predicted values does not vary much. The prediction is

more generalizable over the samples.

Suppose we complicate our regression model by increasing degree of the polynomial used. As we

have seen in previous classes, we used to obtain a highly wobbly curve which pass through almost

all points in the training data. This is an example for less bias. For a given training data our

prediction could be very good. If we were to take another sample data we would have obtained

12 LECTURE 12 : BIAS-VARIANCE TRADEOFF 58

another curve which pass through all the new points but with a drastic difference from the current

curve. Our predictions are accurate for the training sample chosen, but at the same time they are

highly correlated to the sample we have chosen. For different training data chosen, the variance of

the prediction is very high. So the model is not generalizable over the samples.

We saw that when we decrease the bias the variance increases and vice versa. The more complex

the model is, the less bias we have and more the variance. Both are contrary to each other. The

ideal complexity of the model should be related with the complexity of the actual relation between

the dependent and independent variable.

I recommend the reference [4] for a good example.

13 LECTURE 13 59

13 Lecture 13

These are the topics discussed in today’s lecture:

1. Conclude Bias-Variance

2. Shrinkage - Best Subset

3. Mixture models

4. Empirical Bayes

13.1 Conclude Bias-Variance

13.1.1 Summary

(f(x) − E[y/x])2

dx dy dTD =

(

ETD

f(x, TD) − E[y/x]

(1)

+ ETD

f(x, TD) − ETD

f(x, TD)

)

dx (2)

−

E[y/x] − y

dx dy (3)

In the above equation, TD represents the random sample. x, y represent the data distribution.

E[Y/X] is the expected value optimal with respect to least squares.

(1) represents the bias,

(2) represents the variance and

(3) represents the intrinsic noice. More the flexbility of the hypothesis function,

less is the bias and more is the variance

Now, does this analysis apply to Bayesian Regression?

E[X/TD] is something like a posterior distribution and expected value is there. But

there is no concept of f(x, TD) which is the posterior estimate.

13.1.2 Bayesian Linear Regression(BLR)

P(y/x, yT

, α, σ2

) = N(µm

T φ(x), σ2m)

where

µm =

σmφ

T yT and

13 LECTURE 13 60

Σm

−1 =

I +

T φ

equation

The Basis function:

T =







φ1(x1) φ2(x1) .... ....

φ1(x2) φ2(x2) .... ....

.... .... .... ....







If two points are far way from mean, but close to each other then φ

T φ will increase and Σm

−1 will also increase. Therefore variance will decrease.

This can also be interpreted as, points which are far apart are positively contributing

by giving less variance i.e; less uncertainity.

Also, Uncertainity with the prior is represented by 1

I, and uncertainity in the

estimator that we are explicitly modelling for, is represented by 1

α2 φ

T φ.

Assume, the peak points of the two

gaussian distributions are at φ(x1) and φ(x2).

The point represented in red has

equal contributions from both the

gaussians.

Now, If φ(x1)

T φ(x2) is large, assuming

φ is normalized, standard deviation increases. In gaussian distribution, at regions away from the mean, φ(xi) will be

13 LECTURE 13 61

large.

13.1.3 General Problems with Standard

Distribution

1. Single Mode - N(µ, Σ)

2. A non-trivial Σ =⇒ nCr parameters. As dimension of data n grows, n

2 will grow

in Σ

3. Search space is sparse.

Mixture of Gaussians

p(x) = Xn

i=1

αipi(x|z = i) (pi(x) is a different distribution)

i=1

αi = 1

p =⇒ p is a convex combination of individual distribution.

What is form of distribution ?

For the i

th distribution taking αi as the mean where αi

is a multinomial

Ex : Mixture of Gaussians :

13 LECTURE 13 62

p(x) = Σn

i=1αiN(x|µi

, Σi)

X ∼ N(µi

, Σi)

p(x) = p(x|µi

, Σi)

Issues

1. Number of K’s

2. Estimating µi

’s and Σi

’s

3. Estimating αi

’s

Classification Perspective

Assume data :







X1, 1

X2, 1

.... ..

Xm, 3







(z is a class label)

Question : What is MLE ??

=⇒ It is a Classification Problem

If z value is given :

P(z = t|y) = P(x|z = t)P(z = t)/P(x)

It is a supervised classfication problem.

13 LECTURE 13 63

Given data :







....







We have to estimate using hidden variables as z is not explicitly given. (EM algorithm which can be shown to converge).

Target

1. Implicit estimation of z E[x|xi

]

2. Estimate µi

’s with E[z|xi

] in place of zi

i.e







X1, E[z|x1]

X2, E[z|x2]

.... ..

Xm, E[z|xm]







Estimating number of K’s is hard.It also can be classified as a modelling problem

and thus

number of K’s depend on the model we chose.

EM Algorithm (Source : Wikipedia)

The Expectation-maximization algorithm can be used to compute the parameters

of a parametric mixture model distribution (the ai

’s and θi

’s). It is an iterative algorithm with two steps: an ”expectation step” and a ”maximization step”..

The expectation step

With initial guesses for the parameters of our mixture model, ”partial membership” of each data point in each constituent distribution is computed by calculating

[[expectation value]]s for the membership variables of each data point. That is, for

13 LECTURE 13 64

each data point xj and distribution Yi

, the membership value yi,j is:

yi,j =

aifY (xj

; θi)

fX(xj )

The maximization step

With expectation values in hand for group membership, “plug-in estimates“ are

recomputed for the distribution parameters.

The mixing coefficients ai are the arithmetic mean’s of the membership values

over the N data points.

ai =

j=1

yi,j

The component model parameters θi are also calculated by expectation maximization using data points xj that have been weighted using the membership values. For

example, if θ is a mean µ

µi =

yi,jxj

yi,j

With new estimates for ai and the θi

’s, the expectation step is repeated to recompute new membership values. The entire procedure is repeated until model

parameters converge.

13.2 Emperical Bayes

There are two approaches to solve the equation

Pr(y | D) = Pr(y |< y1, φ(x1) >, < y2, φ(x2) >, ... < yn, φ(xn) >)

RRR Pr(y | w, σ2

) Pr(w | y, α, σ ¯

) Pr(α, σ2

| y¯), dw dα dσ2

where ¯y is the data D

13.2.1 First Approach: Approximate the posterior

The first approach involves approximating the posterior i.e, the second term Pr(w |

y, α, σ ¯

) as wMAP , i.e, as the mode of the posterior distribution of w which is gaus-

13 LECTURE 13 65

sian. Note that as the number of data points keep increasing φ

T φ keeps increasing,

hence from the relation

−

m1 = Σ−

1 + 1

σ2 φ

T φ

it is clear that the posterior variance decreases, hence the distribution of w peaks.

13.2.2 Second Approach: Emperical Bayes

The second approach is to emperically assume some value of the hyperparameters α

and σ

, say ˆα and ˆσ

for which the posterior will peak. i.e, we have

Pr(y | D) ≈

2 Pr(y | wMAP , σ2

) Pr(α, σ2

| y¯) Pr(w | y, α, σ ¯

), dw dα dσ2

≈

Pr(y | w, σ2

) Pr(w | y, ¯ α, ˆ

ˆσ

) dw for the chosen ˆα and ˆσ

≈ N(φ

T µmσ

m)N(µm, Σm)

Emperical Bayes finds the ˆα and ˆσ

2 such that Q

i Pr(yi

| TD) i.e, conditional

likelihood is maximised.

13.2.3 Sove the eigenvalue equation

(

T φ)ui = λiui

define parameter γ as γ =

λi

α+λi

then emperical ˆα is ˆα =

µTmµm

and emperical ˆσ

2 is ˆσ

2 =

m−γ

i=1(Yi − µ

mφ(xm))2

14 LECTURE 14 : INTRODUCTION TO CLASSIFICATION 66

14 Lecture 14 : Introduction to Classification

The goal in classification is to take an input vector x and assign it to one of D

discrete classes

f(x) : R

n → D

There are many techniques of performing the task of classification. The two main

types are

1. Using a discriminant function: For e.g.: Consider D = {+, -}.

y =

(

+ if f(x) ≥ 0

− if f(x) < 0

Here y : {f(x) = 0} is called disciminant / decision surface (decision boundary)

Examples: Least square, Fischer disciminant, Support Vector

Machines, Perceptron, Neural Networks

2. Probabilistic Classification

(a) Disciminative models: Here we model P r(y D | x) directly. For e.g.: we

can say that P r(D = + | data) comes from a multinomial distribution.

Examples: Logistic Regression, Maximum Entropy models, Conditional Random Fields

(b) Generative models: Here we model P r(y D | x) by modeling P r(x | y D)

For Example:

P r(x | y = c1) ∼ N(µ1, Σ1)

P r(x | y = c2) ∼ N(µ2, Σ2)

We can find P r(y | x) as

P r(y | x) = P r(x | y)P r(y)

ΣyP r(x | y)P r(y)

Here, P r(y = c1), P r(y = c2) are called priors or mixture components.

Examples: Naive Bayes, Bayes Nets, Hidden Markov Models

15 LECTURE 15: LINEAR MODELS FOR CLASSIFICATION 67

15 Lecture 15: Linear Models for Classification

The goal in classification is to take an input vector x and assign it to one of K

discrete classes Ck where k = 1, ..., K. In most cases, the classes are taken to be

disjoint, so that each input is assigned to one and only one class. The input space is

thus divided into decision regions whose boundaries are called decision boundaries

or decision surfaces. We consider only linear models for classification in this lecture,

which means that the decision surfaces are linear functions of the input vector x

and hence are defined by (D −1)-dimensional hyperplanes within the D-dimensional

input space.

The simplest method of classification (for 2 classes) is to design a function f such

that

f(xi) = (

vc+

if xi ∈ C+

vc− if xi ∈ C−

15.1 Generalized linear models

In these models we adopt linear regression to model the classification problem. This

is done by modeling a function f as follows:

f(x) = g(w

T φ(x))

where g is known as activation function and φ the vector of basis functions. Classification is achieved by:

g(θ) = (

vc+

if θ > 0

vc− if θ < 0

The decision surface in this case is given by wT φ(x) = 0

15.2 Three broad types of classifiers

1. The first method involves explicit construction of w for wT φ(x) = 0 as the

decision surface.

2. The second method is to model P(x|C+) and P(x|C−) together with the prior

probabilities P(Ck) for the classes, from which we can compute the posterior

probabilities using Bayes’ theorem

P(Ck|x) = P(x|Ck)P(Ck)

P(x)

These types of models are called generative models.

15 LECTURE 15: LINEAR MODELS FOR CLASSIFICATION 68

3. The third method is to model P(C+|x) and P(C−|x) directly. These types of

models are called discriminative models. In this case P(C+|x) = P(C−|x) gives

the required decision boundary.

15.2.1 Examples

An example of generative model is as follows:

P(x|C+) = N (µ+, Σ)

P(x|C−) = N (µ−, Σ)

With prior probabilities P(C+) and P(C−) known, we can derive P(C+|x) and

P(C+|x). In this case it can be shown that the decision boundary P(C+|x) = P(C−|x)

is a hyperplane.

An example of discriminative model is

P(C+|x) = e

wT φ(x)

1 + ewT φ(x)

P(C−|x) = 1

1 + ewT φ(x)

Examples of first model (which directly construct the classifier) include

Linear Regression

Perceptron

Fisher’s Discriminant

15.3 Handling Multiclasses

We now consider the extension of linear discriminants to K > 2 classes. One solution

is to buid a K-class discriminant by combining a number of two-class discriminant

functions.

one-versus-the-rest: In this approach, K − 1 classifiers are constructed, each of

which separtes the points in a particular class Ck from points not in that classes

one-versus-one: In this method, KC2 binary discriminant functions are introduced, one for every possible pair of classes.

Attempting to construct a K class discriminant from a set of two class discriminants leads to ambiguous regions. The problems with the first two approaches are

illustrated in the figures 1 and 2, where there are ambiguous regions marked with

’?’.

15 LECTURE 15: LINEAR MODELS FOR CLASSIFICATION 69

Figure 17: Illustrates the ambiguity in one-versus-rest case

Figure 18: Illustrates the ambiguity in one-versus-one case

15 LECTURE 15: LINEAR MODELS FOR CLASSIFICATION 70

15.3.1 Avoiding ambiguities

We can avoid above mentioned difficulties by considering a single K-class discriminant comprising K functions gCk

(x). Then x is assigned to a class Ck that has the

maximum value for gCk

(x)

If gCk

(x) = wT

φ(x) the decision boundary between class Cj and class Ck is given

by gCk

(x) = gCj

(x) and hence corresponds to

Ck − w

)φ(x) = 0

15.4 Least Squares approach for classification

We now apply the Least squares method to the classification problem. Consider a

classification problem with K classes. Then the target values are represented by a

K component target vector t. Each class is described by its own model

yk(x) = w

k φ(x)

where k ∈ {1, ...K}. We can conveniently group these together using vector notation

so that

y(x) = WT φ(x)

where W is a matrix whose k

th column comprises the unknown parameters wk and

φ(x) is the vector of basis function values evaluated at the input vector x. The

procedure for classification is then to assign a new input vector x to the class for

which the output yk = wT

k φ(x) is largest.

We now determine the parameter matrix W by minimizing a sum-of-squares error

function. Consider a training data set {xn, tn} where n ∈ {1, .., N}, where xn is input

and tn is corresponding target vector. We now define a matrix Φ whose n

th row is

given by φ(xn).

Φ =





φ0(x1) φ1(x1) . . . φK−1(x1)

φ0(x2) φ1(x2) . . . φK−1(x2)

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

φ0(xN ) φ1(xN ) . . . φK−1(xN )





We further define a matrix T whose n

th row is given by the vector t

. Now, the

sum-of-squares error function can then be written as

err(W) = 1

T r{(ΦW − T)

(ΦW − T)}

We can now minimize the error by setting the derivative with respect to W to zero.

The solution we obtain for W is then of the form

W = (Φ

T Φ)

−1Φ

15 LECTURE 15: LINEAR MODELS FOR CLASSIFICATION 71

Figure 19: Data from two classes classified by least squares (magenta) and logistic (green)

Figure 20: Response of the classifiers to addition of outlier points

15.4.1 Limitations of Least Squares

Even as the least-squares approach gives a closed form solution for the discriminant

function parameters, it suffers from problems such as lack of robustness to outliers.

This is illustrated in figures 3 and 4 where we see that introduction of additional data

points in the figure 4 produce a significant change in the location of the decision

boundary, even though these points would be correctly classified by the original

boundary in figure 3. For comparison, least squares approach is contrasted with

logisitc regression, which remains unaffected due to the additional points.

16 LECTURE 16 72

16 Lecture 16

16.1 Introduction

We will discuss the problems of the Linear regression model for classifications. We

will also look at some of the possible solutions of these problems. Our main focus is

on two class classification problem.

16.2 Problems of linear regression

The following are the problems with linear regression model for classification:

1. Sensitivity to outliers

2. Masking

16.2.1 Sensitivity to outliers

Outliers : They are points which have noise and adversely affect the classification.

Figure 21: Outliers

In the right hand figure , the separating hyperplane has changed because of the

outliers.

16.2.2 Masking

It is seen empirically that linear regression classifier may mask a given class. This

is shown in the left hand figure. We had 3 classes one in between the other two. The

between class points are not classified.

16 LECTURE 16 73

Figure 22: Masking

The right hand figure is the desirable classification.

The equation of the classifier between class C1(red dots) and class C2(green dots)

(ω1 − ω2)

T φ(x) = 0

and the equation of the classifier between the classes C2(green dots) and C3(blue

dots) is

(ω2 − ω3)

T φ(x) = 0

16.3 Possible solutions

1. Mapping to new space

We will transform the original dimensions to new dimensions. New dimensions are function of original dimensions. This is a work around solution.

(x) = σ1(φ1, φ2)

(x) = σ2(φ1, φ2)

Here we try to determine the transformations φ

and φ

such that we can

get a linear classifier in this new space. When we map back to the original

dimensions , the separators may not remain linear.

16 LECTURE 16 74

Figure 23: Mapping back to original dimension class separator not linear

Problem : Exponential blowup of number of parameters (w

s) in order O(n

k−1

2. Decision surface perpendicular bisector to the mean connector.

Figure 24: Class separator perpendicular to the line joining mean

Decision surface is the perpendicular bisector of the line joining mean of class

C1(m1) and mean of class C2(m2).

m1 = (1/N1)

n∈C1

xn where m1 is the mean of class C1 and N1 is the number

of points in class C1.

m2 = (1/N2)

n∈C2

xn where m2 is the mean of class C2 and N2 is the number

16 LECTURE 16 75

of points in class C2.

||φ(x) − m1|| < ||φ(x) − m2|| => x ∈ C1

||φ(x) − m2|| < ||φ(x) − m1|| => x ∈ C2

Comment : This is solving the masking problem but not the sensitivity

problem as this does not capture the orientation(eg: spread of the data points)

of the classes.

3. Fisher Discrimant Analysis.

Here we consider the mean of the classes , within class covariance and global

covariance.

Aim : To increase the separation between the class means and to minimize

within class variance. Considering two classes.

SB is Inter class covariance and SW is Intra class covariance.

m1 = (1/N1)

n∈C1

xn where m1 is the mean of class C1 and N1 is the number

of points in class C1.

m2 = (1/N2)

n∈C2

xn where m2 is the mean of class C2 and N2 is the number

of points in class C2.

N1 + N2 = N where N is the total number of training points.

SB = (m2 − m1)(m2 − m1)

SW =

n∈C1

(xn − m1)(xn − m1)

T +

n∈C2

(xn − m2)(xn − m2)

J(w) = (w

T SBw)/(w

T Sww)

By maximizing J(w) we get the following:

wαS−1

w (m2 − m1)

16 LECTURE 16 76

16.4 Summary

Sensitivity to outliers Masking

Perpendicular Bisector of means connector Does not solve Solves

Fischer Discriminant Does not solve Solves

We have seen that the fisher discriminant analysis is better compared to the other

two possible solutions. Fisher Discriminant analysis for k classes is discussed in the

next lecture.

17 LECTURE 17 77

17 Lecture 17

Not submitted

18 LECTURE 18:PERCEPTRON 78

18 Lecture 18:Perceptron

Was Fisher’s discriminant robust to noise?

Perceptron training

18.1 Fisher’s discriminant

From the figure, it can be seen that Fischer Discriminant method is not robust to

noise. The difference in inclination of blue (without noise) and magenta (with noise)

lines shows that the outlying points affect the classifier more than desired. This is

because Fischer discriminant does not take into account the distance of the data

points from the hyperplane. Perceptron, unlike Fischer, considers the distance and

is thus robust to noise.

18.2 Perceptron training

Explicitly account for signed distribution of points(misclassified points) from

hyperplane

T φ(x) = 0

Distance from hyperplane can be calculated as follows

φ(x)

T φ(x) = 0

φ(x0)

D = w

(φ(x) − φ(x0))

Since w

(φ(x0)) = 0 we get distance = w

(φ(x))

Perceptron works for two classes only. We label them as y=1 and y=-1. A point

is misclassified if w

(φ(x)) < 0

Perceptron Algorithm:

INITIALIZE: w=ones()

REPEAT:

18 LECTURE 18:PERCEPTRON 79

– If given < x, y >, wTΦ(x).y ≤ 0

– then, w = w + Φ(x).y

– endif

18.2.1 Intuition

ywT

k+1φ(x) = y(wk + yφ(x)

T φ(x)

= ywT

k φ(x) + y

kφ(w)k

> ywT

k φ(x)

Note: We applied the update for this point,

Since ywT

k φ(x) ≤ 0

We have ywT

k φ(x) > ywT

k φ(x) So we have more hope that this point is classified

correctly now.

More formally, perceptron tries to minimize the error function

E = −

x∈M

yφT

(x)ω

where M is the set of misclassified examples.

Perceptron algorithm is similar (Its not exactly equivalent) to a gradient descent

algorithm, which can be shown as follows:

Gradient Descent (Batch Perceptron) Algorithm

Since ∇E is given by,

∇E = −

x∈M

yφ(x)

So,

wk+1 = wk − η∇E

= wk + η

x∈M

yφ(x) (This takes all misclassified points at a time)

But what we are doing in standard Perceptron Algorithm, is basically Stochastic

Gradient Descent:

∇E = −

x∈M

yφ(x) = −

x∈M

∇E(x) , where E(x) = yφ(x)

wk+1 = wk − η∇E(x)

= wk + ηyφ(x) (for any x ∈ M)

18 LECTURE 18:PERCEPTRON 80

Earlier it was intuition, now, Formally,:-

If ∃ an optimal separating hyperplane with parameters w

∗

such that,

(x)w

∗ = 0

then perceptron algorithm converges.

Proof:-

lim

k→∞

kwk+1 − ρw∗

2 = 0 (68)

(If this happens for some constant ρ, we are fine.)

kwk+1 − ρw∗

2 = kwk − ρw∗

2 + kyφ(x)k

2 + 2y(wk − ρw∗

)

T φ(x) (69)

Now, we want L.H.S. to be less than R.H.S. at every step, although by some small

value, so that perceptron will converge overtime.

So, if we can obtain an expression of the form:

kwk+1 − ρw∗

2 < kwk − ρw∗

2 − θ

(70)

Then, kwk+1 − ρw∗k

is reducing by atleast θ

2 at every iteration.

So, from the above expressions (2) and (3), we need to find θ such that,

kφ(x)k

2 + 2y(wk − ρw∗

)

T φ(x) < −θ

(Here, kyφ(x)k

2 = kφ(x)k

2 because kyk = 1, y is either +1 or −1)

So, the no. of iterations would be: O

kw0−ρw∗k

Observations:-

1. ywT

k φ(x) ≤ 0 (∵ x was misclassified)

2. Γ

2 = max

x∈D

kφ(x)k

3. δ = max

x∈D

−2yw∗T

φ(x)

Here, margin = w

∗T φ(x) = dist. of closest point from hyperplane

and, D is the set of all points, not just misclassifed ones.

δ = max

x∈D

−2yw∗T

φ(x)

= min

x∈D

yw∗T

φ(x)

Since, w∗T φ(x) ≥ 0, so, δ ≤ 0.

So, what we are interested in, is the ’least negative’ value of δ

From the observations, and eq.(2), we have:

0 ≤ kwk+1 − ρw∗

2 ≤ kwk − ρw∗

2 + Γ2 + ρδ

18 LECTURE 18:PERCEPTRON 81

Taking, ρ =

2Γ2

−δ

, then,

0 ≤ kwk+1 − ρw∗

2 ≤ kwk − ρw∗

2 − Γ

Hence, we got, Γ2 = θ

, that we were looking for in eq.(3).

∴ kwk+1 − ρw∗k

2 decreases by atleast Γ2 at every iteration.

Here is the notion of convergence:-

wk converges to ρw∗ by making atleast some decrement at each step.

Thus, for k → ∞, kwk − ρw∗k → 0,

Hence, eq.(1) is proved.

19 LECTURE 19 82

19 Lecture 19

19.1 Introduction

In this lecture,we extend the margin-concept towards our goal of classification and

introduce ourselves to Support Vector Machines(SVM)

19.2 Margin

Given w

,the unsigned minimum distance of x from the hyperplane:

φ(x) = 0 (71)

is given by:

min

yw?

φ(x) (72)

where y = ±1.Here,y is the corresponding target classifier value for the case of 2-

class classifiers.Note that multiplication with y makes the distance unsigned. This

classification is greedy.

Figure 25: H3(green) doesn’t separate the 2 classes. H1(blue) does, with a small margin and

H2(red) with the maximum margin. [5]

Intuitively, a good separation is achieved by the hyperplane that has the largest

distance to the nearest training data points of any class, since in general the larger

the margin the lower the generalization error of the classifier. [5]

19 Lecture 19 83

19.3 Support Vector Machines

The idea in a Support Vector Machine(SVM) is to Maximize the Minimum

Unsigned Distance (ywT φ(x)) of a point x from the hyperplane

T φ(x) = 0 (73)

The factor y which is also the target classifier ensures that the unsigned distance

is positive semi-definite. Posing this as an optimization problem where:

φ = [1, φ0, φ1, ....φn] (74)

w = [w0, w1, ....wn] (75)

where 1 and w0 together denote the bias parameters and [w1....wn] denote the

slope.

Optimization Goal: To adjust the Slope and Bias to Maximize the Minimum

unsigned distance from the hyperplane.

Figure 26: Maximum Margin Hyperplanes and Margins for SVM trained with 2 classes.Samples on

the margin are called the support vectors. [5]

19.4 Support Vectors

The seperating hyperplane is dependent only on the support vectors. Those points

which are not support vectors are irrelevant.

19 Lecture 19 84

19.5 Objective Design in SVM

Keeping the Bias(w0) and Slope(w) separate,

y(x) = w

T φ(x) + w0 (76)

y(x) represents the Separating Hyperplane.

19.5.1 Step 1:Perfect Separability

The condition for existence of a Separating Hyperplane(Perfect Separability) is:

∃w, w0∀yi

, xiD wTφ(xi) + w0 > 0 if yi = +1 (77)

wTφ(xi) + w0 < 0 if yi = −1 (78)

19.5.2 Step 2:Optimal Separating Hyperplane For Perfectly Separable Data

max

w,w0

δ (79)

wTφ(xi) + w0 > δ if yi = +1 (80)

wTφ(xi) + w0 < −δ if yi = −1 (81)

δ ≥ 0 (82)

But for two distinct w1, w2 defining the same Hyperplane,signed distance can be

different for the same point depending on the value of ||w||.

Thus fixing ||w||,the Optimization Objective is:

max

w,w0

δ (83)

yi(wTφ(xi) + w0) > δ (84)

δ ≥ 0 (85)

||w|| = θ (86)

(87)

Without restriction on ||w||,δ goes unbounded.

We can get rid of the assumption that ||w|| = θ by replacing the conditions with

the following while redefining w

0 =

θw0

||w|| to get:

19 Lecture 19 85

max

w,w0

δ (88)

||w||yi(wTφ(xi) + w0

) > δ (89)

δ ≥ 0 (90)

or Equivalenty,

max

w,w0

δ (91)

yi(wTφ(xi) + w0

) > δ ||w||

(92)

δ ≥ 0 (93)

Since for any w and w

satisfying these inequalities, any positive multiple satisfies

them too, we can arbitrarily set ||w|| =

. Thus, the equivalent problem is:

max

w,w0

||w|| (94)

yi(wTφ(xi) + w0

) > 1 (95)

Or Equivalently:

min

w,w0

||w||2

(96)

yi(wTφ(xi) + w0

) > 1 (97)

The above transformation is fruitful as the determination of model parameters

reduces to a Convex Optimization problem and hence,any locally optimum solution

is globally optimum.

[6]

19.5.3 Step 2:Separating Hyperplane For Overlapping Data

Unlike the previous case wherein Data was either ”Black” or ”White”,herein we have

a Region of ”Gray”.

Earlier,we implicitly used an error function that gave infinite error if a data point was

misclassified and zero error if it was classified correctly. We now modify this approach

so that data points are allowed to be on the ”wrong side” of the margin boundary, but with a penalty that increases with the distance from that boundary.

Thus, the objective to account for the Noise.Hence, we shall introduce a slack

variable ζi

19 Lecture 19 86

Now the Optimization Objective is:

min

w,w0

||w||2

(98)

yi(wTφ(xi) + w0

) > 1 − ζi (99)

ζi ≥ 0 (100)

or Equivalently:

min

w,w0

||w||2

2 + c

i=1

(101)

yi(wTφ(xi) + w0

) > 1 − ζi (102)

ζi ≥ 0 (103)

Thus,the objective is analogous to the Minimization of Error subject to Regulariser.

Here,the Error is:

E = c

i=1

(104)

And the Regulariser is:

R =

||w||2

(105)

Both EandR can be proved to be convex.

Partially differentiating E w.r.t. ζj

δE

δζj

= 2ζj (106)

δζ2

= 2 > 0 (107)

Thus E is convex w.r.t ζj∀1 ≤ j ≤ n

Consider,R:

5(R) = 2w (108)

(R) = 2I (109)

19 LECTURE 19 87

2I is clearly positive definite with the eigenvalues 2(twice).

Thus R is convex.

Therefore,E + R is convex and the Optimization again reduces to a Convex Optimization Problem.

20 LECTURE 20: SUPPORT VECTOR MACHINES (SVM) 88

Proof.

Figure 27: Margins in SVM

20 Lecture 20: Support Vector Machines (SVM)

This lecture formulates the primal expression of SVM. Then it applies KKT conditions on top of that. Then we proceed to form the dual problem.

20.1 Recap

The expression from previous day:

yi(φ

(xi)w + w

) ≥

||w||

(110)

So, any multiple of w and w

0 would not change the inequality.

20.2 Distance between the points

Important Result 1. The distance between the points x1 and x2 in 20.2 is 2

||w||

The distance between the points x1 and x2 in 20.2 turns out to be:

||φ(x1) − φ(x2)|| = ||rw|| (111)

We have,

T φ(x1) + w0 = −1 (112)

and

T φ(x2) + w0 = 1 (113)

Subtracting Equation 113 from Equation 112 we get,

20 LECTURE 20: SUPPORT VECTOR MACHINES (SVM) 89

(φ(x1) − φ(x2)) = −2 (114)

=⇒ w

rw = −2 (115)

=⇒ r = −

||w||2

(116)

=⇒ ||rw|| = −

||w|| (117)

Hence proved.

20.3 Formulation of the optimization problem

max

||w|| (118)

s.t.∀i (119)

yi(φ

(xi)w + w

) ≥ 1 (120)

This means that if we maximize the separation between margin planes and at the

same time ensure that the respective points are not crossing corresponding margin

plane (the constraint), we are done.

It can be proved that,

if max ||distance of closest point from hyperplane||p

then, max

||w||q

s.t.∀i

yi(φ

(xi)w + w

) ≥ 1

where1

= 1

In our case p = 2 so q = 2. For p = 1, q = ∞ means maximum.

20 LECTURE 20: SUPPORT VECTOR MACHINES (SVM) 90

Figure 28: Different types of points

20.4 Soft Margin SVM

min

w,w0

||w||2 + c

ξi (121)

s.t.∀i (122)

yi(φ

(xi)w + w

) ≥ 1 − ξi (123)

where, (124)

∀iξi ≥ 0 (125)

In soft margin we account for the the errors. The above formulation is one of the

many formulation of soft SVMs. In the above formulation, large value of c means

overfitting.

20.4.1 Three types of g points

In subsubsection 20.4.1 we can see three types of points. They are:

1. Correctly classified but ξi > 0 or violates margin

2. Correctly classified but ξi = 0 or on the margin

3. Inorrectly classified but ξi > 1

20 LECTURE 20: SUPPORT VECTOR MACHINES (SVM) 91

20.5 Primal and Dual Formulation

20.5.1 Primal formulation

∗ = min f(x) (126)

x ∈ D (127)

s.t.g(x) ≤ 0 (128)

i = 1, . . . , m (129)

20.5.2 Dual Formulation

∗ = max

λ∈R

min

x∈D

f(x) +Xm

i=1

λigi(x)

(130)

s.t.λi ≥ 0 (131)

Equation 130 is and convex optimization problem. Also, d

∗ ≤ p

∗ and (p

∗ − d

∗

) is

called the duality gap.

If for some (x

∗

, λ∗

) where x

∗

is primal feasible and λ

∗

is dual feasible and we see

the KKT conditions are satisfied and f is and all gi are convex then x

∗

is optimal

solution to primal and λ

∗

to dual.

Also, the dual optimization problem becomes,

∗ = max

λ∈Rm

L(x

∗

, λ) (132)

s.t. (133)

λi ≥ 0∀i (134)

where, (135)

L(x, λ) = f(x) +Xm

i=1

λigi(x) (136)

∗

(λ) = min

x∈D

L(x, λ) (137)

= max

λ∈R

L(KKT(x), λ) (138)

λi ≥ 0∀i (139)

It happens to be,

∗ = d

∗

(140)

20 LECTURE 20: SUPPORT VECTOR MACHINES (SVM) 92

20.6 Duality theory applied to KKT

L( ¯w, ¯ξ, w0, α, ¯ λ¯) = 1

||w||2 + c

i=1

ξi +

i=1

αi

1 − ξi − yi

(xi)w + w0

−

i=1

λiξi

(141)

Now we check for KKT conditions at the point of optimality,

KKT 1.a

∇wL = 0 (142)

=⇒ w −

j=1

αjyjφ

(xj ) = 0 (143)

KKT 1.b

∇xiiL = 0 (144)

=⇒ c − αi − λi = 0 (145)

KKT 1.c

∇w0L = 0 (146)

=⇒ w −

i=1

αiyi = 0 (147)

KKT 2

∀i (148)

(xi)w + w0

≥ 1 − ξi (149)

ξi ≥ 0 (150)

KKT 3

Lj ≥ 0andλk ≥ 0 (151)

∀j, k = 1, . . . , m (152)

KKT 4

(xj )w + w0

− 1 + ξj

= 0 (153)

λkξk = 0 (154)

(a)

∗ =

j=1

αjyiφ(xj ) (155)

∗

is weighted linear combination of points φ(x)s.

20 LECTURE 20: SUPPORT VECTOR MACHINES (SVM) 93

(b)

If 0 < αj < c then, by Equation 145

0 < λj < c and by Equation 154, ξj = 0 and yi

(xj )w + w0

= 1

If however, αj = c then λj = 0 and yi

(xj )w + w0

≤ 1.

If α0 then λj = c and ξj = 0, we get yi

(xj )w + w0

≥ 1. Then αj = 0

21 LECTURE 21:THE SVM DUAL 94

21 Lecture 21:The SVM dual

21.1 SVM dual

SVM can be formulated as the following optimization problem,

min

{

kwk

2 + C

i=0

ξi}

subject to constraint,

∀i : yi(φ

(xi)w + w0) ≥ 1 − ξi

The dual of the SVM optimization problem can be stated as,

max{−1

i=1

j=1

yiyjαiαjφ

(xi)φ(xj ) +Xm

j=1

αj}

subject to constraints,

∀i :

αiyi = 0

∀i : 0 ≤ αi ≤ c

The duality gap = f(x

∗

) − L

∗

(λ

∗

) = 0, as shown in last lecture. Thus, as is

evident from the solution of the dual problem,

∗ =

i=1

∗

yiφ(xi)

To obtain w

∗

, we can use the fact (as shown in last lecture) that, if αi ∈ (0, C),

yi(φ

(xi)w + w0) = 1. Thus, for any point xi such that, αi ∈ (0, C), that is, αi

is a

point on the margin,

∗

o =

1 − yi(φ

(xi)w

∗

)

= yi − φ

(xi)w

∗

The decision function,

g(x) = φ

(x)w

∗ + w

∗

i=0

αiyiφ

(x)φ(xi) + w

∗

21.2 Kernel Matrix

A kernel matrix

21 LECTURE 21:THE SVM DUAL 95

K =







(x1)φ(x1) φ

(x1)φ(x2) . . . . . . φT

(x1)φ(xn)

(x2)φ(x1) φ

(x2)φ(x2) . . . . . . φT

(x2)φ(xn)

. . . . . . . . . . . . . . . .

(xn)φ(x1) φ

(xn)φ(x2) . . . . . . φT

(xn)φ(xn)







In other words, Kij = φ

(xi)φ(xj ). The SVM dual can now be re-written as,

max{−1

T Kyα + α

ones(m, 1)}

subject to constraints,

αiyi = 0

0 ≤ αi ≤ c

Thus, for αi ∈ (0, C)

∗

0 = yi − φ

(xi)w

= yi −

j=0

∗

yjφ

(xi)φ(xj )

= yi −

j=0

∗

yjKij

21.2.1 Generation of φ space

For a given x = [x1, x2, . . . , xn] → φ(x) = [x

, xd

, . . . , xd−1

1 x2, . . . ].

For n = 2, d = 2, φ(x) = [x

, x1x2, x2x1, x2

], thus,

(x).φ(¯x) = Xm

i=1

j=1

xixj

.x¯ix¯j

= (Xm

i=1

xix¯i).(

j=1

xjx¯j )

= (Xm

i=1

xix¯i)

= (x

T x¯)

In general, for n ≥ 1 and d ≥ 1, φ

(x).φ(¯x) = (x

T x¯)

21 LECTURE 21:THE SVM DUAL 96

A polynomial kernel, in general, is defined as Kij = (x

i xj )

21.3 Requirements of Kernel

1. Since

Kij = φ

(xi)φ(xj )

= φ

(xj )φ(xi)

Hence K should be a Symmetric Matrix.

2. The Cauchy Schwarz Inequality

(φ

(x)φ(¯x))2 ≤ kφ

(x)k

2kφ(¯x)k

⇒ Kij

2 ≤ KiiKjj

3. Positivity of Diagonal

K = V ΛV

Where V is the eigen vector matrix (an orthogonal matrix), and Λ is the Diagonal matrix of eigen values.

Goal is to construct a φ. Which can be constructed as

φ(xi) = √

λiVi (λi ≥ 0)

Kii = λikVik

Hence K must be

1. Symmetric.

2. Positive Semi Definite.

3. Having non-negative Diagonal Entries.

21.3.1 Examples of Kernels

1. Kij = (xi

T xj )

2. Kij = (xi

T xj + 1)d

21 LECTURE 21:THE SVM DUAL 97

3. Gaussian or Radial basis Function (RBF)

Kij = e

−

kxi−xjk

2σ2 (σ ∈ R, σ 6= 0)

4. The Hyperbolic Tangent function

Kij = tanh(σxT

i xj + c)

21.4 Properties of Kernel Functions

If K0 and K00 are Kernels then K is also a Kernel if either of the following holds

1. Kij = K0

ij + K00

2. Kij = αK0

ij (α ≥ 0)

3. Kij = K0

ijK00

Proof : (1) and (2) are left as an exercise.

(3)

Kij = K0

ijK00

= φ

)φ

) ∗ φ

00T

)φ

00(x

)

Define φ(xi) = φ

)φ

00T

). Thus, Kij = φ(xi)φ(xj ).

Hence, K is a valid kernel.

22 LECTURE 22: SVR AND OPTIMIZATION TECHNIQUES 98

22 Lecture 22: SVR and Optimization Techniques

Topics covered in this lecture

Variants of SVM (other occurance of kernel)

Support Vector Regression

L1 SVM

Projection Method (kernel Adatron)

22.1 Other occurance of kernel

for regression, we have

W = (φ

T φ)φ

f(X) = WT φ(x) = φ(φ

T φ)

T φ(x)y

For perceptron

W =

i=i

αiφ(xi)yi

αi = no of times update on w was made for < xi

, yi >

g(xj ) = φ

(xj ).w =

αiφ

(xj )φ(xi)yi =

αikijyi

22.1.1 Some variants of SVM’s

We have considered some variants of SVM,

minw,ξ

||w||2 + C

ξi

st., ∀i yi(φ

(xi)w + w0) ≥ 1 − ξi and ξi > 0

This is 1-Norm SVM, i.e norm of the slack (ξ). We can also formulate withou ξ > 0

minw,ξ

||w||2 + C

if C = 0 , w=0 is a trivial solution

22 LECTURE 22: SVR AND OPTIMIZATION TECHNIQUES 99

Dual for this problem:-

max −

αiαj (kij +

δij )yiyj +

αi

s.t., X

αiyi = 0

δij = 1 if i = j

and = 0 otherwise

also 0 ≤ αi ≤ C

The constraint X

αiyj = 0 keep appearing again and again. This appear because

of w0 being explicit. It will dissapear if w0 were absorbed in w.

In linear regression (error)

and W = (φ

T φ)

−1φ

In ridge regression (error + λ||w||2

)

and W = (φ

T φ − λI)

−1φ

22.2 Support Vector Regression

The Support Vector method can also be applied to the case of regression (apart

from classification problem), maintaining all the main features that characterise the

maximal margin algorithm, preserving the property of sparseness.

A non-linear function is learned by a linear learning machine in a kernel-induced

feature space while the capacity of the system is controlled by a parameter that does

not depend on the dimensionality of the space. The figure below shows a situation

for a non-linear regression function.

The insensitive band (slackness) for a non-linear regression function.

22 LECTURE 22: SVR AND OPTIMIZATION TECHNIQUES 100

As long as points lie inside the margin, they do not contribute to the error.

We can define the -insensitive loss function L

(x, y, f) as:-

For linear :

(x, y, f) = |y − f(x)|ε = max(0, |y − f(x)| − ε)

The linear -insensitive loss for zero and non-zero .

For Quadratic :

(x, y, f) = |y − f(x)|

The quadratic -insensitive loss for zero and non-zero .

In adapted ridge regression when slackness is introduced we have:-

min

||w||2 + C

Xξ

st.(ΦT

(xi)w + w0) − yi = ξi

22 LECTURE 22: SVR AND OPTIMIZATION TECHNIQUES 101

we can optimise the generalisation of our regressor by minimising the sum of the

quadratic -insensitive losses. For SVR-2 norm

min

||w||2 + C

X(ξ

i + ξ

)

st, ∀i(ΦT

(xi)w + w0) − yi ≤ ε + ξi

and, yi − (ΦT

(xi)w + w0) ≥ ε + ξ

‘

also ξiξ

‘

i = 0

and for 1-norm we have

X(ξi + ξ

‘

) and

ξi ≥ 0, ξ‘

i ≥ 0

For 2-norm case, the dual problem can be derived using the standard method and

taking into account that ξiξ

‘

i = 0 and therefore that the same relation αiα

‘

i holds for

the corresponding Lagrange multipliers:

maximiseX

i=i

yi(α

‘

i − αi) − ε

i=i

yi(α

‘

i + αi)−

i=1

j=1

yi(α

‘

i − αi)(α

‘

j − αj )(Kij +

δij )

subjectto :

i=i

(α

‘

i − αi) = 0

‘

i ≥ 0, αi ≥ 0, α‘

iα = 0

The corresponding Karush - Kuhn - Tucker complementarity conditions are [7]

αi(< w.φ(xi) > +b − yi − ε − ξi) = 0

‘

(yi− < w − φ(xi) > −b − −ε − ξ

‘

) = 0

ξiξ

‘

i = 0 αiα

‘

i = 0

By substituting β = α

‘−α and using the relation αiα

‘

i = 0, it is possible to rewrite

the dual problem in a way that more closely resembles the classification case

maximiseX

i=1

yiαi − ε

i=1

|αi

|−1

i,j=1

αiαj (K(φ(xi).φ(xj )) + 1

δij )

subjecttoX

i=1

βi =0

Notes:-

Ridge Regression has 1 parameter - λ,

SVM 2-norm has 2 parameters - (C − 1/λ) and ε

22 LECTURE 22: SVR AND OPTIMIZATION TECHNIQUES 102

SVR with 2 norm and ε = 0 ≡ Ridge Regression

if ε = 0 and as C → ∞,

SVR 2-norm → linear regression

SVR 2-norm and SVM 2-norm have [ 1

δij ] added to kernel matrix in dual.

22.3 L1 SVM

Let training datum be xi(i = 1, ..., M) and its label be yi = 1 if xi belongs to Class

1, and yi = −1 if Class 2. In SVMs, to enhance linear separability, the input space is

mapped intoa high dimensional feature space using the mapping function g(x). To

obtain the optimal separating hyperplane of the L1-SVM in the feature space, we

consider the following optimization problem:

minimize

kWk

2 + C

i=0

ξi

subject to yi(Wt

g(xi) + b) > 1 − ξi

for i = 1, . . . , M,

where W is a weight vector, C is the margin parameter that determines the tradeoff

between the maximization of the margin and the minimization of the classification

error, ξi (i = 1, ..., M) are the nonnegative slack variables and b is a bias term.

Introducing the Lagrange multipliers αi

, we obtain the following dual problem:

maximize

Q(α) = X

i=0

αi −

i=0

αiαjyiyjg(xi)

g(xj ),

subject to X

i=0

yiαi = 0, 0 ≤ αi ≤ C

We use the mapping function that satisfies

H(x, x0

) = g(x)

g(x

where H(x, x0

) is a kernel function. By this selection, we need not treat the variables

in the feature space explicitly. Solving the above dual problem, we obtain the decision

function:

D(x) = X

i=0

∗

yiH(xi

, x) + b

∗

22 LECTURE 22: SVR AND OPTIMIZATION TECHNIQUES 103

L1 SVM is like Lasso and it gives sparse solution.

22.4 Kernel Adatron

The ”Perceptron with optimal stability” has been the object of extensive theoretical

and exper- imental work, and a number of simple iterative procedures have been

proposed, aimed at finding hyperplanes which have ”optimal stability” or maximal

margin. One of them, the Adatron, comes with theoretical guarantees of convergence

to the optimal solution, and of a rate of convergence exponentially fast in the number of iterations, provided that a solution exists. Such models can be adapted,

with the introduction of kernels, to operate in a high- dimensional feature space,

and hence to learning non- linear decision boundaries. This provides a procedure

which emulates SV machines but doesn’t need to use the quadratic programming

toolboxes.The Adatron is a an on-line algorihm for learning perceptrons which has an

attractive xed point cor- responding to the maximal-margin consistent hyper- plane,

when this exists.By writing the Adatron in the data-dependent repre- sentation, and

by substituting the dot products with kernels, we obtain the following algorithm:

Kernel Adatron Algoritm

1. Initialise αi = 1.

2. Starting from pattern i = 1, for labeled points (xi

, yi calculate zi = yi

j=0

αiyiK(x, x0

3. For all patterns i calculate γi = yizi and execute steps 4 to 5 below. 4. Let

δαi = η(1 − γ

) be the proposed change to the multipliers α

5.1. If (α

i + δαi

)≤ 0 then the proposed change to the multipliers would result in a

negative α

. Consequently to avoid this problem we set α

i = 0. 5.2 If (α

i +δαi

) ≥ 0

then the multipliers are updated through the addition of the δαi

i.e. α

i ← α

i + δαi

6. Calculate the bias b from b =

(min(z

) + max(z

−

))

where z

those patterns i with class label +1 and z

−

are those with class label −1.

7. If a maximum number of presentations of the pat- tern set has been exceeded

then stop, otherwise return to step 2.

Every stable point for adatron algorithm is a maximal margin point and vice

versa. The algorithm converges in a finite number of steps to a stable point if a

solution exists. The primal problem for adatron is given below.

minimize

αiαjKijyiyj −

Xαi

subject to Xαiyi = 0, 0 ≤ αi ≤ C

23 LECTURE 23 104

23 Lecture 23

Key terms : SMO, Probabilistic models, Parzen window

In previous classes we wrote the convex formulation for maximum margin classification, Lagrangian of the formulation etc. Then dual of the program was obtained

by first minimizing the lagrangian with respect to weights w. The dual will be maximization of it with respect to α which is same as minimizing the negative of the

objective function under the same constraints. The dual problem, given in equation

(156) is an optimization with respect to α and its solution will correspond to optimal

of original objective when the KKT conditions are satisfied. We are interested in

solving dual of the objective because we have already seen that most of the dual

variable will be zero in the solution and hence it will give a sparse solution (based

on the KKT conidtion).

Dual: min

−

Xαi +

αiαjyiyjKij (156)

s.t.

αiyi = 0

αi ∈ [0, c]

The above program is a quadratic program. Any quadratic solvers can be used

for solving (156), but a generic solver will not take consider speciality of the solution

and may not be efficient. One way to solve (156) is by using projection methods(also

called Kernel adatron). You can solve the above one using two ways - chunking

methods and decomposition methods.

The chunking method is as follows

1. Initialize αis arbitrarily

2. Choose points(I mean the components αi) that violate KKT condition

3. Consider only K working set and solve the dual for the variables in working set

∀α ∈ working set

min

−

αiinW S

αi +

i∈W S

j∈W S

αiαjyiyjKij (157)

s.t.

i∈W S

αiyi = −

j /∈W S

αjyj

αi ∈ [0, c]

4. set α

new = [α

new

W S , αold

nonW S]

23 LECTURE 23 105

Decompsition methods follow almost the same procedure except that in step 2 we

always take a fixed number of points which violate the KKT conditions the most.

23.1 Sequential minimization algorithm - SMO

We can’t take just one point at a time in working set and optimize with respect to it,

because the second last constraint can’t be satisified. So choose 2 points and optimize

with respect to that. This is what is done in SMO, that is SMO is a decompostion

method with 2 points taken at a time for opimization. The details are given below

Without loss of generality take the points α1 and α2 in the working set. Then the

program (157) can be rewritten as

min

α1,α2

− α1 − α2 −

i6=1,2

αi +

1K11 + α

2K22 + α1α2K12y1y2

+ α1y1

i6=1,2

K1iαiyi + α2y2

i6=1,2

K2iαiyi (158)

s.t. α1y1 + α2y2 = −

j6=1,2

αjyj = α

old

1 + α

old

α1, α2 ∈ [0, c]

From the second last constraint, we can write α1 in terms of α2.

α1 = −α2

+ α

old

1 + α

old

Then the objective is just a function of α2, let the objective is −D(α2). Now the

program reduces to

min

α2

− D(α2)

s.t. α2 ∈ [0, c]

Find α

∗

such that ∂D(α2)

∂α2

= 0. We have to ensure that α1 ∈ [0, c]. So based on

that we will have to clipp α2 , ie, shift it to certain interval. The condition is as

follows

0 <= −α2

+ α

old

1 + α

old

<= c

case 1: y1 = y2

α2 ∈ [max(0, −c + α

old

1 + α

old

), min(c, αold

1 + α

old

)]

case 2: y1 = −y2

α2 ∈ [max(0, αold

2 − α

old

), min(c, c − α

old

1 + α

old

)]

If α2 is already in the inerval then there is no problem. If it is more than the

maximum limit then reset it to the maximum limit. This will ensure the optimum

value of the objective constrained to this codition. Similarly if α2 goes below the

lower limit then reset it to the lower limit.

23.2 Probablistic models

In one of the previous lectures probablistic models were mentioned. They are of two

types conditional and generative based on the variable over which the distribution

is defined. Conditional models define a distribution over class given the input and

the Generative models define a joint distribution of both dependent and independent

variables.

The classification models can again be divided into two - parametric and nonparametric. The parametric forms assumes a distribution over the class or the input

which are controlled by parameters, for example the class output ∼ (N)(w

T φ(x), σ)

where σ,w are parameters . During the training phase the parameters would be

learned.

For a classification task we will have a scoring function gk(x) based on which we

will dot classification. The point x will be classified to argmaxk

gk(x).

For a discriminative model the function gk(x) = ln(p(Ck|x)), ie it models the conitional probability of the class variable with respect to the input.

For a generative models gk(x) = ln(p(x|Ck))p(Ck) − ln(p(x))(can be obtained by

Bayes rule). The generative model model a joint distribution of the input and

class variables.

24 LECTURE 24: PROB. CLASSIFIERS 107

24 Lecture 24: Prob. Classifiers

24.1 Non Parametric Density Estimation

The models in which the form of the model is not specified a priori but is instead

determined from data are called Non parametric models. The term non-parametric

is not meant to imply that such models completely lack parameters but that the

number and nature of the parameters are flexible and not fixed in advance. Non

Parametric models are generally generative methods.

The probability P that a vector x will fall in a region R is given by

P =

p(x) dx (159)

Suppose that n iid samples are randomly drawn according to the probability distribution p(x). Probability that k of these n fall in R is

Pk =

(1 − P)

(n−k)

(160)

Expected value of k is

E[k] = nP (161)

If n is very large then k

n will be a good estimate for the probability P.

p(x

) dx

0 ' p(x)V (162)

where x is a point within R and V is the volume enclosed by R. Combining Eqs

(159) & (162) we get:

p(x) '

(163)

Kernel Density Estimation

One technique for nonparametric density estimation is the kernel density estimation where, effectively, V is held fixed while K, the number of sample points

24 LECTURE 24: PROB. CLASSIFIERS 108

lying withing V is estimated. The density is given by treating K as a kernel function, e.g., a Gaussian function, centered on each data point and then adding the

functions together. The quality of the estimate depends crucially on the kernel

function. For the Gaussian case, the non-parametric density estimator is known

as the Parzen-window density estimator.

In the Eq (163) we could keep V fixed & estimate K. For example, we could

consider V to be a hypercube of length d & dimension n:

P(x) = Σx0D

K(x, x

)

n ∗ |D|

(164)

K(x, x

) = (

1 if kxi − x

ik ≤ d ∀ i ∈ [1 : n]

0 if kxi − x

ik > d for some i ∈ [1 : n]

(165)

For smooth kernel density estimation, we could use

K(x, x

) = e

−kx−x

2σ2 (166)

Since K is smooth we need not specify volume (V ); σ implicitly defines V .

P(x) = 1

|D|

Σx0DK(x, x

) (167)

Parzen Window Classifier:

– Given r classes, for each class build a density model.

D = D1 ∪ D2...... ∪ Dr

P(x|Ci) = Pi(x) = 1

|Di

|Σx0DiK(x, x

)

– estimate P(Ci) = |Di|

|D|

– The class chosen is the one that maximizes the posterior distribution, i.e.,

argmax

ai(x) = log[P(x|Ci)P(Ci)]

*(assuming the same σ for all classes)

A potential problem with kernel density classifiers can be that they could be

biased toward more populated classes, owing to the class prior.

24 LECTURE 24: PROB. CLASSIFIERS 109

P(Ci) = |Di|

|D|

A more severe issue is that σ is fixed for all classes.

K-Nearest Neighbours(K-NN)

The idea behind this class of kernel desity estimators is to hold K constant and

instead determine the volume V of the tighest sphere that encompasses the K

samples. The volumne is a non-decreasing function of K. K-NN is non-smooth.

P(Ci

|x) = Ki

where Ki

is number of points that fall in class Ci out of K points nearest to a

given point x. The steps in K-NN density estimation are as follows:

– Given a point x, find the set of K nearest neighbours (Dk)

– For each class Ci compute Ki = |{x |x ∈ Dk and x ∈ Ci}|, that is Ki

the number of points from class Ci that belong to the nearest neighbour set

Dk.

– Classify x into Cj = argmaxCi

The decision boundaries of K-NN are highly non-linear.

24.2 Parametric Density Estimation

Parametric methods assume a form for the probability distribution that generates the data and estimate the parameters of the distribution. Generally

parametric methods make more assumptions than non-parametric methods.

– Gaussian Discriminant:

P(x|Ci) = N (φ(x), µi

, Σi) = 1

(2π)

n/2

|Σi

|1/2

exp(

−(φ(x) − µi)Σ−1

(φ(x) − µi)

)

(168)

Given point x,classify it to class Ci such that,

Ci = argmaxi

log[P(x|Ci)P(Ci)]

let ai = log[P(x|Ci)] + log[P(Ci)] = w

i φ(x) + wi0

where,

24 LECTURE 24: PROB. CLASSIFIERS 110

wi = Σiµi

wi0 =

−1

i Σ

−1

i µi + ln[P(Ci)] −

ln[(2π)

|Σi

|] −

φ(x)

TΣ

−1

i φ(x)

Maximum Likelihood Estimation:

(µ

MLE

, Σ

MLE

) = argmaxµ,Σ LL(D, µ, Σ) = argmaxµ,Σ ΣiΣxDi

log[N (x, µi

, Σi)]

(169)

1. Σ ’s are common across all classes i.e., Σi = Σ ∀i

Maximum Likelihood estimates using (169) are:

MLE

i = ΣxDi

φ(x)

|Di

MLE = Σk

i=1

|D|

ΣxDi

(φ(x) − µi)(φ(x) − µi)

2. Σ

s are also parameters

Maximum Likelihood estimates are:

MLE

i = ΣxDi

φ(x)

|Di

MLE

i = ΣxDi

|Di

(φ(x) − µi)(φ(x) − µi)

We could do this for exponential family as well.

– Exponential Family:

For a given vector of functions φ(x) = [φ1(x), . . . , φk(x)] and a parameter

vector η<

, the exponential family of distributions is defined as

P(x, η) = h(x)exp{η

T φ(x) − A(η)} (170)

where the h(x) is a conventional reference function and A(η) is the log

normalization constant designed as

A(η) = log[

xRange(x)

exp{η

T φ(x)}h(x)dx]

Example:

* Gaussian Distribution: Gaussian Distribution X ∼ N (µ, σ) can be

expressed as

η = [ µ

σ2

−1

2σ2

]

φ(x) = [x, x2

]

25 LECTURE 25 112

25 Lecture 25

25.1 Exponential Family Distribution

Considering The Exponential Family Distribution:

p(φ(x)|ηk) = h(φ(x)) exp(η

k φ(x) − A(ηk))(for class k). (171)

25.2 Discrete Feature Space

φ(x) is the feature space. Considering the case when it is discrete valued.

φ(x) = [attr1, attr2, ..] (172)

Let φ(x) have n attributes and each of the n attributes can take c different (discrete) values.

Total number of possible values of φ(x) is c

φ(x) =





nval1 nval2 . . . .nvaln

(1)(x) . . . . .

(2)(x) . . . . .

: . . . . . .

(x) . . . . . .





(173)

Size of the above table is c

For example, when n = 3 and c = 2 ({0, 1} ∀ attributes), table showing all possible

values of φ(x) will be:

25 LECTURE 25 113





0 0 0

0 0 1

0 1 0

0 1 1

1 0 0

1 0 1

1 1 0

1 1 1





(174)

Thus size of the table of p(φ

(x)|ηk) will also be c

p(φ(x)|ηk) =





P robaility | Conf iguration

p1 | φ

(1)(x)

p2 | φ

(2)(x)

: | :

pcn | φ

(x)





(175)

whereΣ

pi = 1 (176)

It is clear that p(φ(x)|ηk) ≡ p(x|ηk)

25.3 Naive Bayes Assumption

As the size(c

) is exponential in the dimension of feature space, it is not feasible to

work with the full table even in a

moderately large dimension feature space.

One possible way out is to approximate the probability distribution so that this size

is reduced considerably.

Naive Bayes is one such approximation in which the assumption is:

p(φ(x)|ηk) = p(φ1(x)|ηk)p(φ2(x)|ηk). . . p(φn(x)|ηk) (177)

25 LECTURE 25 114

where, φi(x) denotes the i-th attribute of φ(x) in the feature space.

Thus, what Naive Bayes Assumption essentially says is that each attribute is independent of other attributes given the class.

So the original probability distribution table of p(φ(x)|ηk) of size c

n will get replaced

by n tables

(One per attribute) each of size cx1 as follows :

p(φj (x)|ηk) =





pj1 | φj (x) = µj1

pj2 | φj (x) = µj2

: | :

pjc | φj (x) = µjc





(178)

where µj1..µjc are c discrete values that φj (x) can take

andΣ

pji = 1 (179)

NOTE:This assumption does NOT say that :

p(φ(x)) = p(φ1(x)). . . p(φn(x))

25.4 Graphical Models

Discussion on Graphical Models was done from the slides [8], [9].

25.5 Graphical Representation of Naive Bayes

c = φ1(x)φ2(x). . . φn(x) (180)

Figure 29: Graphical model representation of Naive Bayes

The fact that there is no Edge between φ1(x) and φ2(x) denotes Conditional

Independence.

p(φ, c) = p(φ|c).p(c)

= p(φ2|φ1, φ3, c).p(φ1|φ3, c).p(φ3|c).p(c)

≈ p(φ2|φ1).p(φ1|c).p(φ3|c).p(c)(F romF igure29)

= Πp(x|π(x))(π(x) ≡ Set of P arents of x)

25.6 Graph Factorisation

Think of p(φ(x), c)(Factorised) as Specifying a Family of Distributions.

b | c (a is conditionally independent of b given c) means:

p(a|b, c) = p(a|c) (181)

p(b|a, c) = p(b|c) (182)

25.7 Naive Bayes Text Classification

Naive Bayes Text Classification was discussed from the slides [10], [11].

References

[1] “Class notes: Basics of convex optimization,chapter.4.”

[2] “Convex Optimization.”

[3] “Linear Algebra.”

[4] “Bias variance tradeoff.” [Online]. Available: http://www.aiaccess.net/English/

Glossaries/GlosMod/e gm bias variance.htm

[5] Steve Renals, “Support Vector Machine.”

[6] Christopher M. Bishop, “Pattern Recognition And Machine Learning.”

[7] Nello Cristianini and John Shawe-Taylor , “An Introduction to Support Vector

Machines and Other Kernel-based Learning Methods.”

[8] ClassNotes, “Graphical Models,Class Notes.”

[9] ——, “Graphical Case Study of Probabilistic Models,Class Notes.”

[10] Andrew McCallum,Kamal Nigam, “A Comparison of Event Models for Naive

Bayes Text Classification.”

[11] Steve Renals, “Naive Bayes Text Classification.”