ML is the field of study that gives computers the ability to learn without being explicitly programmed

Supervised vs. Unsupervised Machine Learning

Supervised Learning

  x         ----->      y
input              output label

Learns from being given “right answers” (We provide enough examples with input x along with corresponding output label y from which the machine learns. Eventually, when we give an new input without output label, it’ll generate the output by itself based on the learning)

Regression

Type of Supervised Learning that predicts a number (There could be infinitely many possible outputs)

Housing price prediction (Given a set of House size with the corresponding Price for it, we make the algorithm to fit a curve along the given points, now when we need the price for a new house with a randomly input House size, we can get the value from the point of intersection of the curve)

Classification

We use classification to predict only a small number of possible outcomes/ categories (Ex: Predict if a tumor is malignant or benign based on it’s size)

  • Classification predict categories (Cat or Dog; Benign or Malignant; 0 or 1 or 2). The learning algorithm sets a boundary line for the given data for classification

Unsupervised Learning

Find something interesting in unlabeled data (We only provide input x, not output labels y). We let the algorithm to search for a structure in the input data

Clustering

The algorithm groups the unlabeled similar input data into clusters.>

Following are some examples:

  1. Google News (Clustering of similar news)
  2. DNA microarray (Clustering of people with similar DNA characteristics)

Anomaly Detection

Find unusual data points

Dimensionality Reduction

Compress data using fewer numbers

Regression Model

Linear Regression Model

This is a Supervised learning model. Linear regression is one of the regression models. Predicts numbers

Terminology

Data used to train the model = Training Set
x = “input” variable (feature)
y = “output” variable (target)
ŷ = modles predicted value
y = true or observed value of the dependent variable
m = number of training examples
(x, y) = single training example
(x, y) = i^t$$^h training example

                 +-------------------+
                 |   Training Set    |
                 +-------------------+
                           |
                           v
                 +-------------------+
                 | Learning Algorithm|
                 +-------------------+
                           |
                           v
                     +----------------+
          x  ---->   |  f(x) Function |  ---->  ŷ
      (feature)      +----------------+       (prediction)

Example:

                  +-----+
size (x)  ---->   |  f  |  ---->  price (ŷ - estimated)
                  +-----+

Cost Function: Squared Error cost function

A mathematical function that measures the difference between a model’s predictions and the actual values

The cost function ( J(w, b) ) is defined as:

The goal is to find ( w, b ) such that:

Visualizing the cost function

  • When we represent the cost function with two input parameters (w, b) we’ll get a 3D shape
  • The height of any point corresponding to w and b values will be the Cost Function value (The higher the point, higher the value of cost function)
  • To represent the cost function in 2D, we use the Contour plot graph which appears to be the sliced version of the 3D plot (The smallest oval shape represents the minimum value of the Cost Function)
  • For Linear Regression models, there will be only one Local Minimum ~ Global Minimum. However, for other models, there’ll be multiple local minima and the one with the lowest value will be the Global Minimum

Gradient Descent

Algorithm to minimize a function (Cost Function in our case)

  • Start with some w, b (Set w=0, b=0 as the initial guess)
  • Keep changing w, b to reduce J(w, b) (Simultaneously update w and b)
  • Until we settle at or near a minimum
  • In the algorithm below, we use Batch Gradient descent because each step of gradient descent uses all the training examples (i.e i=1 to i=m)

Gradient Descent Algorithm

Repeat until convergence:

Where:

  • is the learning rate
  • is the derivative of the cost function with respect to
  • is the derivative of the cost function with respect to

Formula after calculating the derivative Repeat until convergence:

Where:

  • is the hypothesis(Models’ prediction function)
  • is the learning rate
  • is the number of training examples
  • is the training example

Multiple Linear Regression

Multiple Features

We represent each training example with multiple input features. Let:

Hypothesis for Multiple Linear Regression

Vectorization

  1. Makes the code shorter
  2. Makes the code run much faster (NumPy library uses parallel hardware either CPU or GPU)

Let’s consider the following data

w = np.array([1.0,2.5,-3.3])
b = 4
x = np.array([10,20,30])

Code to find f(x) without vectorization (Option 1)

f = w[0] * x[0] +
	w[1] * x[1] +
	w[2] * x[2] + b

Code to find f(x) without vectorization (Option 2)

f =0
for j in range(0,n):
	f = f + w[j] * x[j]
f = f + b

Code with Vectorization

f = np.dot(w,x) + b

Gradient Descent for Multiple Linear Regression

Gradient descent now becomes as follows:

Alternative to gradient descent

Normal Equation

  • Only for linear regression
  • Solve for w, b without iterations Disadvantages
  • Doesn’t generalize to other learning algorithms
  • Slow when number of features is large (> 10,000) What we need to know
  • Normal equation method may be used in machine learning libraries that implement linear regression
  • Gradient descent is the recommended method for finding parameters w,b

Gradient Descent in Practice

Feature Scaling

  • When the possible range of values of a feature is large (House size in sq. feet x1 = 300 to 2,000), it’s more likely that a good model will choose a small parameter value (w1 = 0.1)
  • When the possible range of values of a feature is small (No. of bedrooms x2 = 0 to 5), it’s more likely that a good model will choose a large parameter value (w2 = 50)
  • But, when we have different features that take on different range of values, it can cause gradient descent to run slowly. But, rescaling different features so that they can all take comparable range of values can speed up gradient descent computation much faster

Divide the range by max value

Mean Normalization

Feature ranges

Mean values (Of all the values of the feature)

Normalization formulas

Labels

Z-score Normalization

Ranges before normalization

Mean and standard deviation

Z-score formulas

Labels

Feature Engineering

Using intuition to design new features, by transforming or combining original features

  • Let’s take the problem of finding price of house given two features x1(Frontage) and x2(Depth)
  • We can include a third feature x3(Area) = x1(Frontage) (x) x2(Depth) to enhance the model performance

Polynomial Regression

Fitting non-linear function (Curves). Given a single feature x, we can make it a quadratic, cubic equation etc. for a better fit (Gives a curve instead of a straight line model)

Quadratic Model

Cubic Model

Square Root

Classification with Logistic Regression

Binary classification - The output can only be one of two values(Class / Category) (e.g either yes or no)

Logistic Regression

We want outputs between 0 and 1, so we use the sigmoid (logistic) function:

Define:

Then the logistic regression model is:


If the logistic regression model gives a probability that the output is 1:

This is interpreted as:

If:

Then there’s a 70% chance that ( y = 1 ) (i.e., malignant).

Since this is a binary classification:


Decision Boundary

Set a threshold above which the result will be considered 1 and 0 if below

Here is how we can decide on the decision boundary:

Cost function for Logistic Regression

  • The Squared error cost function we used for Linear regression(Gives a bowl shape - convex function) doesn’t work well for Logistic Regression (Gives Non-convex graph)

Logistic Cost Function

Logistic Loss Function

Simplified Cost Function

Gradient Descent Implementation

The problem of Overfitting

Addressing Overfitting

  1. Collect more training data (The learning algorithm will learn to fit better)
  2. Select features to include/ exclude (Feature selection : Choose the most relevant features)
  3. Regularization (Shrink the values of parameters instead of altogether removing the features like discussed in the previous step)

Cost Function with Regularization

  • The first term: Mean squared error (MSE) – fits the data.
  • The second term: L2 regularization – penalizes large weights​ to reduce overfitting.
  • lambda: Regularization strength – balances fit vs simplicity.

Regularized Linear Regression

🔹 Regularized Cost Function

🔹 Gradient Descent Updates

For the bias term 𝑏 (not regularized):

Regularized Logistic Regression

🔹 Regularized Cost Function

🔹 Gradient Descent Updates

For the bias term