ML is the field of study that gives computers the ability to learn without being explicitly programmed

Supervised vs. Unsupervised Machine Learning

Supervised Learning

  x         ----->      y
input              output label

Learns from being given “right answers” (We provide enough examples with input x along with corresponding output label y from which the machine learns. Eventually, when we give an new input without output label, it’ll generate the output by itself based on the learning)

Regression

Type of Supervised Learning that predicts a number (There could be infinitely many possible outputs)

Housing price prediction (Given a set of House size with the corresponding Price for it, we make the algorithm to fit a curve along the given points, now when we need the price for a new house with a randomly input House size, we can get the value from the point of intersection of the curve)

Classification

We use classification to predict only a small number of possible outcomes/ categories (Ex: Predict if a tumor is malignant or benign based on it’s size)

Classification predict categories (Cat or Dog; Benign or Malignant; 0 or 1 or 2). The learning algorithm sets a boundary line for the given data for classification

Unsupervised Learning

Find something interesting in unlabeled data (We only provide input x, not output labels y). We let the algorithm to search for a structure in the input data

Clustering

The algorithm groups the unlabeled similar input data into clusters.>

Following are some examples:

Google News (Clustering of similar news)
DNA microarray (Clustering of people with similar DNA characteristics)

Anomaly Detection

Find unusual data points

Dimensionality Reduction

Compress data using fewer numbers

Regression Model

Linear Regression Model

This is a Supervised learning model. Linear regression is one of the regression models. Predicts numbers

Terminology

Data used to train the model = Training Set
x = “input” variable (feature)
y = “output” variable (target)
ŷ = modles predicted value
y = true or observed value of the dependent variable
m = number of training examples
(x, y) = single training example
(x $^{i}$ , y $^{i}$ ) = i $^t$$^h$ training example

                 +-------------------+
                 |   Training Set    |
                 +-------------------+
                           |
                           v
                 +-------------------+
                 | Learning Algorithm|
                 +-------------------+
                           |
                           v
                     +----------------+
          x  ---->   |  f(x) Function |  ---->  ŷ
      (feature)      +----------------+       (prediction)

Example:

                  +-----+
size (x)  ---->   |  f  |  ---->  price (ŷ - estimated)
                  +-----+

Cost Function: Squared Error cost function

A mathematical function that measures the difference between a model’s predictions and the actual values

The cost function ( J(w, b) ) is defined as:

J (w, b) = \frac{1}{2 m} i = 1 \sum m (\overset{y}{^}^{(i)} - y^{(i)})^{2}

\overset{y}{^}^{(i)} = f_{w, b} (x^{(i)}) = w x^{(i)} + b

y^{(i)} is the true label/output

m is the number of training examples

f_{w, b} (x) is the hypothesis (linear model)

The goal is to find ( w, b ) such that:

\overset{y}{^}^{(i)} \approx y^{(i)} for all (x^{(i)}, y^{(i)})

Visualizing the cost function

When we represent the cost function with two input parameters (w, b) we’ll get a 3D shape
The height of any point corresponding to w and b values will be the Cost Function value (The higher the point, higher the value of cost function)
To represent the cost function in 2D, we use the Contour plot graph which appears to be the sliced version of the 3D plot (The smallest oval shape represents the minimum value of the Cost Function)
For Linear Regression models, there will be only one Local Minimum ~ Global Minimum. However, for other models, there’ll be multiple local minima and the one with the lowest value will be the Global Minimum

Gradient Descent

Algorithm to minimize a function (Cost Function in our case)

Start with some w, b (Set w=0, b=0 as the initial guess)
Keep changing w, b to reduce J(w, b) (Simultaneously update w and b)
Until we settle at or near a minimum
In the algorithm below, we use Batch Gradient descent because each step of gradient descent uses all the training examples (i.e i=1 to i=m)

Gradient Descent Algorithm

Repeat until convergence:

w := w - α \frac{\partial}{\partial w} J (w, b)

b := b - α \frac{\partial}{\partial b} J (w, b)

Where:

$α$ is the learning rate
$\frac{\partial}{\partial w} J (w, b)$ is the derivative of the cost function with respect to $w$
$\frac{\partial}{\partial b} J (w, b)$ is the derivative of the cost function with respect to $b$

Formula after calculating the derivative Repeat until convergence:

w := w - α \cdot \frac{1}{m} i = 1 \sum m (f_{w, b} (x^{(i)}) - y^{(i)}) x^{(i)}

b := b - α \cdot \frac{1}{m} i = 1 \sum m (f_{w, b} (x^{(i)}) - y^{(i)})

Where:

$f_{w, b} (x^{(i)}) = w x^{(i)} + b$ is the hypothesis(Models’ prediction function)
$α$ is the learning rate
$m$ is the number of training examples
$(x^{(i)}, y^{(i)})$ is the $i^{t h}$ training example

Multiple Linear Regression

Multiple Features

We represent each training example with multiple input features. Let:

x_{j} = the j^{th} feature

n = number of features

x^{(i)} = feature vector of the i^{th} training example

x_{j}^{(i)} = value of the j^{th} feature in the i^{th} training example

Hypothesis for Multiple Linear Regression

f_{w, b} (x) = w_{1} x_{1} + w_{2} x_{2} + \dots + w_{n} x_{n} + b

w = [w_{1}, w_{2}, w_{3}, \dots, w_{n}]

b is a number

x = [x_{1}, x_{2}, x_{3}, \dots, x_{n}]

Compact vector form using dot product:

f_{w, b} (x) = w \cdot x + b

Can be expanded again to:

f_{w, b} (x) = w_{1} x_{1} + w_{2} x_{2} + w_{3} x_{3} + \dots + w_{n} x_{n} + b

This is the hypothesis function for multiple linear regression.

Vectorization

Makes the code shorter

Makes the code run much faster (NumPy library uses parallel hardware either CPU or GPU)

Let’s consider the following data

w = np.array([1.0,2.5,-3.3])
b = 4
x = np.array([10,20,30])

Code to find f(x) without vectorization (Option 1)

f = w[0] * x[0] +
	w[1] * x[1] +
	w[2] * x[2] + b

Code to find f(x) without vectorization (Option 2)

f =0
for j in range(0,n):
	f = f + w[j] * x[j]
f = f + b

Code with Vectorization

f = np.dot(w,x) + b

Gradient Descent for Multiple Linear Regression

Gradient descent now becomes as follows:

w := w - α \frac{\partial}{\partial w} J (w, b)

b := b - α \frac{\partial}{\partial b} J (w, b)

n features (n \geq 2)

Repeat { for j = 1, \dots, n :

w_{j} = w_{j} - α \cdot \frac{1}{m} i = 1 \sum m (f_{w, b} (x^{(i)}) - y^{(i)}) x_{j}^{(i)}

Update bias: b = b - α \cdot \frac{1}{m} i = 1 \sum m (f_{w, b} (x^{(i)}) - y^{(i)})

} Simultaneously update all w_{j} for j = 1, \dots, n and b

Where f_{w, b} (x^{(i)}) = w \cdot x^{(i)} + b

Alternative to gradient descent

Normal Equation

Only for linear regression
Solve for w, b without iterations Disadvantages

Doesn’t generalize to other learning algorithms
Slow when number of features is large (> 10,000) What we need to know
Normal equation method may be used in machine learning libraries that implement linear regression
Gradient descent is the recommended method for finding parameters w,b

Gradient Descent in Practice

Feature Scaling

When the possible range of values of a feature is large (House size in sq. feet x1 = 300 to 2,000), it’s more likely that a good model will choose a small parameter value (w1 = 0.1)
When the possible range of values of a feature is small (No. of bedrooms x2 = 0 to 5), it’s more likely that a good model will choose a large parameter value (w2 = 50)
But, when we have different features that take on different range of values, it can cause gradient descent to run slowly. But, rescaling different features so that they can all take comparable range of values can speed up gradient descent computation much faster

Divide the range by max value

300 \leq x_{1} \leq 2000 \Rightarrow x_{1, scaled} = \frac{x _{1}}{2000} \Rightarrow 0.15 \leq x_{1, scaled} \leq 1

0 \leq x_{2} \leq 5 \Rightarrow x_{2, scaled} = \frac{x _{2}}{5} \Rightarrow 0 \leq x_{2, scaled} \leq 1

Where x_{1} = size in feet^{2}, x_{2} = bedrooms

Mean Normalization

Feature ranges

300 \leq x_{1} \leq 2000, 0 \leq x_{2} \leq 5

Mean values (Of all the values of the feature)

μ_{1} = 600, μ_{2} = 2.3

Normalization formulas

x_{1}^{normalized} = \frac{x _{1} - μ _{1}}{2000 - 300} \Rightarrow - 0.18 \leq x_{1}^{normalized} \leq 0.82

x_{2}^{normalized} = \frac{x _{2} - μ _{2}}{5 - 0} \Rightarrow - 0.46 \leq x_{2}^{normalized} \leq 0.54

Labels

x_{1} = size in feet^{2}, x_{2} = bedrooms

Z-score Normalization

Ranges before normalization

300 \leq x_{1} \leq 2000, 0 \leq x_{2} \leq 5

Mean and standard deviation

μ_{1} = 600, μ_{2} = 2.3

σ_{1} = 450, σ_{2} = 1.4

Z-score formulas

x_{1}^{standardized} = \frac{x _{1} - μ _{1}}{σ _{1}} \Rightarrow - 0.67 \leq x_{1}^{standardized} \leq 3.1

x_{2}^{standardized} = \frac{x _{2} - μ _{2}}{σ _{2}} \Rightarrow - 1.6 \leq x_{2}^{standardized} \leq 1.9

Labels

x_{1} = size in feet^{2}, x_{2} = bedrooms

Feature Engineering

Using intuition to design new features, by transforming or combining original features

Let’s take the problem of finding price of house given two features x1(Frontage) and x2(Depth)
We can include a third feature x3(Area) = x1(Frontage) (x) x2(Depth) to enhance the model performance

Polynomial Regression

Fitting non-linear function (Curves). Given a single feature x, we can make it a quadratic, cubic equation etc. for a better fit (Gives a curve instead of a straight line model)

Quadratic Model

f_{w, b} (x) = w_{1} x + w_{2} x^{2} + b

Cubic Model

f_{w, b} (x) = w_{1} x + w_{2} x^{2} + w_{3} x^{3} + b

Square Root

f_{w, b} (x) = w_{1} x + w_{2} x + b

where: x is the input feature (e.g., size ) w_{1}, w_{2}, w_{3} are the weights, b is the bias term

Classification with Logistic Regression

Binary classification - The output can only be one of two values(Class / Category) (e.g either yes or no)

Logistic Regression

We want outputs between 0 and 1, so we use the sigmoid (logistic) function:

g (z) = \frac{1}{1 + e ^{- z}} 0 < g (z) < 1

Define:

z = w \cdot x + b

Then the logistic regression model is:

f_{w, b} (x) = g (w \cdot x + b) = \frac{1}{1 + e ^{- (w \cdot x + b)}}

If the logistic regression model gives a probability that the output is 1:

f_{w, b} (x) = \frac{1}{1 + e ^{- (w \cdot x + b)}}

This is interpreted as:

f_{w, b} (x) = P (y = 1 ∣ x; w, b)

If:

f_{w, b} (x) = 0.7

Then there’s a 70% chance that ( y = 1 ) (i.e., malignant).

Since this is a binary classification:

P (y = 0) + P (y = 1) = 1

Decision Boundary

Set a threshold above which the result will be considered 1 and 0 if below

Here is how we can decide on the decision boundary:

f_{w, b} (x) = g (w \cdot x + b) = \frac{1}{1 + e ^{- (w \cdot x + b)}}

f_{w, b} (x) = P (y = 1 ∣ x; w, b)

\overset{y}{^} = {10 if f_{w, b} (x) \geq 0.5 if f_{w, b} (x) < 0.5

f_{w, b} (x) \geq 0.5

g (z) \geq 0.5

z \geq 0

w \cdot x + b \geq 0 \Rightarrow \overset{y}{^} = 1

w \cdot x + b < 0 \Rightarrow \overset{y}{^} = 0

Cost function for Logistic Regression

The Squared error cost function we used for Linear regression(Gives a bowl shape - convex function) doesn’t work well for Logistic Regression (Gives Non-convex graph)

Logistic Cost Function

J (w, b) = \frac{1}{m} i = 1 \sum m L (f_{w, b} (x^{(i)}), y^{(i)})

Logistic Loss Function

L (f_{w, b} (x^{(i)}), y^{(i)}) = {- lo g (f_{w, b} (x^{(i)})) - lo g (1 - f_{w, b} (x^{(i)})) if y^{(i)} = 1 if y^{(i)} = 0

Simplified Cost Function

L (f_{w, b} (x^{(i)}), y^{(i)}) = - y^{(i)} lo g (f_{w, b} (x^{(i)})) - (1 - y^{(i)}) lo g (1 - f_{w, b} (x^{(i)}))

J (w, b) = \frac{1}{m} i = 1 \sum m L (f_{w, b} (x^{(i)}), y^{(i)})

= - \frac{1}{m} i = 1 \sum m [y^{(i)} lo g (f_{w, b} (x^{(i)})) + (1 - y^{(i)}) lo g (1 - f_{w, b} (x^{(i)}))]

Gradient Descent Implementation

Repeat ⎩ ⎨ ⎧ w_{j} b = w_{j} - α [\frac{1}{m} i = 1 \sum m (f_{w, b} (x^{(i)}) - y^{(i)}) x_{j}^{(i)}] = b - α [\frac{1}{m} i = 1 \sum m (f_{w, b} (x^{(i)}) - y^{(i)})]

Linear regression: f_{w, b} (x) = w \cdot x + b

Logistic regression: f_{w, b} (x) = \frac{1}{1 + e ^{- (w \cdot x + b)}}

The problem of Overfitting

Addressing Overfitting

Collect more training data (The learning algorithm will learn to fit better)
Select features to include/ exclude (Feature selection : Choose the most relevant features)
Regularization (Shrink the values of parameters instead of altogether removing the features like discussed in the previous step)

Cost Function with Regularization

w, b min J (w, b) = w, b min [\frac{1}{2 m} i = 1 \sum m (f_{w, b} (x^{(i)}) - y^{(i)})^{2} + \frac{λ}{2 m} j = 1 \sum n w_{j}^{2}]

The first term: Mean squared error (MSE) – fits the data.
The second term: L2 regularization – penalizes large weights to reduce overfitting.
lambda: Regularization strength – balances fit vs simplicity.

Regularized Linear Regression

🔹 Regularized Cost Function

w, b min J (w, b) = w, b min [\frac{1}{2 m} i = 1 \sum m (f_{w, b} (x^{(i)}) - y^{(i)})^{2} + \frac{λ}{2 m} j = 1 \sum n w_{j}^{2}]

🔹 Gradient Descent Updates

w_{j} := w_{j} - α [\frac{1}{m} i = 1 \sum m (f_{w, b} (x^{(i)}) - y^{(i)}) x_{j}^{(i)} + \frac{λ}{m} w_{j}]

w_{j} := w_{j} (1 - α \frac{λ}{m}) - α \frac{1}{m} i = 1 \sum m (f_{w, b} (x^{(i)}) - y^{(i)}) x_{j}^{(i)}

For the bias term 𝑏 (not regularized):

b := b - α [\frac{1}{m} i = 1 \sum m (f_{w, b} (x^{(i)}) - y^{(i)})]

Regularized Logistic Regression

🔹 Regularized Cost Function

J (w, b) = - \frac{1}{m} i = 1 \sum m [y^{(i)} lo g (f_{w, b} (x^{(i)})) + (1 - y^{(i)}) lo g (1 - f_{w, b} (x^{(i)}))] + \frac{λ}{2 m} j = 1 \sum n w_{j}^{2}

🔹 Gradient Descent Updates

w_{j} := w_{j} - α [\frac{1}{m} i = 1 \sum m (f_{w, b} (x^{(i)}) - y^{(i)}) x_{j}^{(i)} + \frac{λ}{m} w_{j}]

For the bias term

b := b - α [\frac{1}{m} i = 1 \sum m (f_{w, b} (x^{(i)}) - y^{(i)})]

Elambharathi Thangavel

Explorer

Supervised Machine Learning - Regression and Classification

Supervised vs. Unsupervised Machine Learning

Supervised Learning

Regression

Classification

Unsupervised Learning

Clustering

Anomaly Detection

Dimensionality Reduction

Regression Model

Linear Regression Model

Cost Function: Squared Error cost function

Visualizing the cost function

Gradient Descent

Gradient Descent Algorithm

Multiple Linear Regression

Multiple Features

Vectorization

Gradient Descent for Multiple Linear Regression

Alternative to gradient descent

Gradient Descent in Practice

Feature Scaling

Divide the range by max value

Mean Normalization

Z-score Normalization

Feature Engineering

Polynomial Regression

Classification with Logistic Regression

Logistic Regression

Decision Boundary

Cost function for Logistic Regression

Gradient Descent Implementation

The problem of Overfitting

Addressing Overfitting

Cost Function with Regularization

Regularized Linear Regression

Regularized Logistic Regression

Graph View

Table of Contents