Introduction
In this post I'll show you how to perform linear regression in order to predict trends on a given data.
Linear regression is a simple but very powerful technique used in a wide range of disciplines like economics, physics and data science.
Given a set of data, it allows us to detect patterns (linear relationships) between those data points so we can later use them to predict new values or detect trends.
Linear programming is a core technique in machine learning. Many of its concepts can be extended to more complex techniques like neural networks, support vector machine and even deep learning. So it's a great topic to start with.
Example
Suppose you are a long term investor trying to predict future prices of the Standard & Poor's 500 US index.
On a daily basis prices fluctuates a lot but in the long term you might be able to detect some trends.
Studying how prices have changed in the past you may be able to derive a pattern or formula that let's you predict prices in the future
Linear regression is a useful machine learning technique for this type of situations.
On a daily basis prices fluctuates a lot but in the long term you might be able to detect some trends.
Studying how prices have changed in the past you may be able to derive a pattern or formula that let's you predict prices in the future
Linear regression is a useful machine learning technique for this type of situations.
Finding the best line
In high school we learn that the equation that describes a line in 2D is:
Where:
So finding the best line means tweaking \(a\) and \(b\) coefficients in order to cover as much points as possible.
Given a certain equation, there are infinite variations (or mutations) we could try to see if they improve the overall error. The previous code randomly tries a few of them and picks the one with the minimum error.
\[y = ax+b\]
- \(a\): slope of the line
- \(b\): the intercept (where the line crosses the Y axis)
- \(x\) and \(y\) are the points in our line
So finding the best line means tweaking \(a\) and \(b\) coefficients in order to cover as much points as possible.
Ideally we would try to find a line that contains all the point but in practice that's usually not possible.
So we need to find a line that is good enough, or in other words a line that minimizes its distance to each point.
Comparing lines
If we have two potential equations for our line, how do we compare them? How do we pick the best one. One way to that is to select the one that is "closer" to most of the points. We can compute the distance of each point to the line and the one that has the smaller distance is the best one.
How do we compute the distance between a point and the line? Given a point \(P(x,y)\) we use \(x\) into the line equation and obtain an estimation of \(y\) (let's call it \({y}'\)):
- \(P(x, y)\): the point we want to test
- \({y}'\): an estimated value of y after using x in our equation \(y = ax+b\)
- distance = \(y-{y}'\), in other the words the real y vs the estimated y'
That distance in some cases will be positive and in other cases negative and we don't want them to be cancelling each other. So one way to compute the overall distance of all points is to use the "least square criterion":
\[\frac{1}{m}\sum_{i=1}^{m}(y-{y}')^2\]
or replacing \({y}'\) for our line equation:
\[\frac{1}{m}\sum_{i=1}^{m}(y-ax+b)^2\]
Where \(m\) is the number of points we have.
or replacing \({y}'\) for our line equation:
\[\frac{1}{m}\sum_{i=1}^{m}(y-ax+b)^2\]
Where \(m\) is the number of points we have.
With this formula we compute the "squared" difference of each point and the estimated point and then we average them all dividing by the number of points (\(m\)).
We can compute that with the following code:
public class Point { public double x; public double y; public Point(double x, double y) { this.x = x; this.y = y; } } public double computeLeastSquareError(List<Point> points, double a, double b) { int m = points.size(); if(m == 0) return 0; double errorSq = 0; for(Point p : points) { double estimatedY = a * p.x + b; double diff = p.y - estimatedY; errorSq += diff * diff; } return errorSq / m; }
That distance is also called the estimation error of our line equation, the cost function or \(J(\theta)\).
So now our problem becomes finding \(a\) and \(b\) in order to produce the minimum square error.
\[Min\left(\frac{1}{m}\sum_{i=1}^{m}(y-ax+b)^2\right)\]
Minimizing a function
Finding a minimum of a function is a very common mathematical problem. The are ways to achieve this for linear regression using a close form of its derivative, what it's called the normal equation approach.
However in this post I want to explore a more algorithmic approach that'll be useful to introduce other techniques of machine learning. Finding the minimum of our expression is in the end a searching problem. We have a bunch of possible values for \(a\) and \(b\) and we want to pick those that satisfies our problem the most (minimum error or minimum distance to each point).
However in this post I want to explore a more algorithmic approach that'll be useful to introduce other techniques of machine learning. Finding the minimum of our expression is in the end a searching problem. We have a bunch of possible values for \(a\) and \(b\) and we want to pick those that satisfies our problem the most (minimum error or minimum distance to each point).
Let's first attack this searching problem with a brute force approach:
public class Equation { public double a; public double b; public Equation(double a, double b) { this.a = a; this.b = b; } }
public Equation findBestEquationBruteForce(List<Point> points, double rangeStart, double rangeEnd, double step) { Equation bestEquation = null; double minError = Double.MAX_VALUE; for (double a = rangeStart; a < rangeEnd; a += step) { for (double b = rangeStart; b < rangeEnd; b += step) { double error = computeLeastSquareError(points, a, b); if(error < minError) { minError = error; bestEquation = new Equation(a, b); } } } return bestEquation; }
We can try the previous code as:
Equation eq = findBestEquationBruteForce(points, -1000000, 1000000, 0.1);
This approach is simple but it explores a large number of combinations that are probably unnecessary. Also setting the right range with the right step is not a trivial task. Ideally you should try a very large range with a very small step so you don't miss any potential good solution. But running that may take more time than you can afford. Instead you could try an iterative approach:
- Start with a large range (like -1,000,000 to 1,000,000) and a not so small step (ex: 10)
- Once you find a first solution reduce the step size and narrow the range within its proximity.
- Keep doing that a few times until the step is smaller than certain threshold or until you've found a good enough solution.
However it's clear that a better way is needed to avoid blindingly checking all these combinations.
Local search
Another approach is to used a local search algorithm:
- You start from some initial random equation.
- You explore a few random choices nearby.
- Pick the one that improves the most (minimum error).
- Continue this process a few times until a certain threshold is reached.
public Equation findBestEquationLocalSearch(List<Point> points, Equation initialEq, int iterations, int mutations, double step) { Equation bestEq = initialEq; double minError = computeLeastSquareError(points, initialEq.a, initialEq.b); for (int i = 0; i < iterations; i++) { for (int j = 0; j < mutations; j++) { Equation eq = new Equation( randMutation(bestEq.a, step), randMutation(bestEq.b, step)); double error = computeLeastSquareError(points, eq.a, eq.b); if(error < minError) { minError = error; bestEq = eq; } } } return bestEq; } private double randMutation(double v, double step) { double sign = Math.random() < 0.5 ? 1 : -1; return v + sign * Math.random() * step; }
With this approach we may find a suitable solution trying less combination than with brute force. However we'll suffer the typical problems of local searching algorithms, like getting stuck in a local optima.
![]() |
| A local search starting on point 2 will get stuck in a local minimum while starting on point 1 will find the global minimum |
The solution we'll get will be very sensitive on the initial point where we start. To avoid this problem we can launch the search multiple times using different random initial equations.
Gradient descent
Most machine learning algorithms use some form of gradient descent to perform a minimum search. Gradient descent is a local search technique like the previous one but it takes advantage of the gradient of the function being minimized in order to pick the next step.
The algorithms proceeds as follows:
- Start from some initial random equation.
- Compute the gradient of the function we are trying to minimize for each coefficient of our equation (\(a\) and \(b\) in our case).
- Subtract each partial gradient from each coefficient.
- Iterate these steps a few times until a certain threshold or a good enough solution is reached.
In our case we are trying to minimize the formula: \(J(\theta)=\frac{1}{m}\sum_{i=1}^{m}(y-ax+b)^2\)
This formula depends on two variables, \(a\) and \(b\). Let's assume for a moment that \(b=0\), so we focus only in finding \(a\). We can plot how our cost function changes for different values of \(a\):
We can think of the gradient as the slope of this function for any given point. The idea behind gradient descent is to perform small steps always in the opposite direction of the slope:
However our cost function has two variables, \(a\) and \(b\), so in order to obtain the gradient we need to compute two partial derivatives:
\[J(\theta)=\frac{1}{m}\sum_{i=1}^{m}(y-ax+b)^2\]
Partial gradient for \(a\):
\[\frac{\partial}{\partial a}=\frac{2}{m}\sum_{i=1}^{m}(y-ax+b).x\]
This formula depends on two variables, \(a\) and \(b\). Let's assume for a moment that \(b=0\), so we focus only in finding \(a\). We can plot how our cost function changes for different values of \(a\):
![]() |
| Plot of how the cost function changes for different values of a |
- If the slope is negative then the global minimum is on the right so we increase the value of \(a\).
- If the slope is negative then we've already passed the minimum so we decrease \(a\).
However our cost function has two variables, \(a\) and \(b\), so in order to obtain the gradient we need to compute two partial derivatives:
\[J(\theta)=\frac{1}{m}\sum_{i=1}^{m}(y-ax+b)^2\]
Partial gradient for \(a\):
\[\frac{\partial}{\partial a}=\frac{2}{m}\sum_{i=1}^{m}(y-ax+b).x\]
Partial gradient for \(b\):
\[\frac{\partial}{\partial b}=\frac{2}{m}\sum_{i=1}^{m}(y-ax+b)\]
It's a common practice to divide the original cost function by 2 so we can get rid of the \(\frac{2}{m}\) component of the gradient:
Updated cost function:
\[J(\theta)=\frac{1}{2m}\sum_{i=1}^{m}(y-ax+b)^2\]
Simplified gradient:
\[\frac{\partial}{\partial a}=\frac{1}{m}\sum_{i=1}^{m}(y-ax+b).x\]
\[\frac{\partial}{\partial b}=\frac{1}{m}\sum_{i=1}^{m}(y-ax+b)\]
\[\frac{\partial}{\partial b}=\frac{2}{m}\sum_{i=1}^{m}(y-ax+b)\]
It's a common practice to divide the original cost function by 2 so we can get rid of the \(\frac{2}{m}\) component of the gradient:
Updated cost function:
\[J(\theta)=\frac{1}{2m}\sum_{i=1}^{m}(y-ax+b)^2\]
Simplified gradient:
\[\frac{\partial}{\partial a}=\frac{1}{m}\sum_{i=1}^{m}(y-ax+b).x\]
\[\frac{\partial}{\partial b}=\frac{1}{m}\sum_{i=1}^{m}(y-ax+b)\]
In each step of gradient descent, each variable is updated by subtracting its own gradient:
\[a = a - \alpha .\frac{\partial}{\partial a}\]
\[b = b - \alpha .\frac{\partial}{\partial b}\]
\(\alpha\) is a configuration variable of our algorithm. It dictates how big or small are the steps we want to perform in each iteration (more about this later).
Let's see how can we implement gradient descent:
public Equation findBestEquationGradientDescent(List<Point> points, Equation initialEq, int iterations, double alpha) { Equation bestEq = initialEq; double m = points.size(); for (int i = 0; i < iterations; i++) { //Accumulate gradients
double gradA = 0; double gradB = 0; for (Point p : points) { double estimatedY = bestEq.a * p.x + bestEq.b; gradA += (p.y - estimatedY) * p.x; //derivative of a
gradB += p.y - estimatedY; //derivative of b
}
//Update both variables in opposite direction of gradient
bestEq.a -= alpha * 1/m * gradA; bestEq.b -= alpha * 1/m * gradB; } return bestEq; }
We can execute the previous function as:
Equation eq = findBestEquationGradientDescent(points, new Equation(0.5, 0.5), 1000, 0.1);
Normalize input
The previous code is sensitive to how big are the absolute values of the points we are using to find our equation. The larger they are the more it'll take gradient descent to converge to the minimum cost.
One way to avoid this problem is to normalize all the inputs, or what it's called feature scaling.
A simple approach is to scale all values to a range of [0, 1] using the following expression:
\[{x}'=\frac{x-min(x)}{max(x)-min(x)}\]
One way to avoid this problem is to normalize all the inputs, or what it's called feature scaling.
A simple approach is to scale all values to a range of [0, 1] using the following expression:
\[{x}'=\frac{x-min(x)}{max(x)-min(x)}\]
public List<Point> normalizeInput(List<Point> points) { List<Point> normPoints = new ArrayList<>(points.size()); double minX = Double.MAX_VALUE; double minY = Double.MAX_VALUE; double maxX = Double.MIN_VALUE; double maxY = Double.MIN_VALUE; for (Point p : points) { minX = Math.min(minX, p.x); minY = Math.min(minY, p.y); maxX = Math.max(maxX, p.x); maxY = Math.max(maxY, p.y); } for (Point p : points) { double normX = (p.x - minX) / (maxX - minX); double normY = (p.y - minY) / (maxY - minY); normPoints.add(new Point(normX, normY)); } return normPoints; }
Conclusion
Linear regression is one of the simplest yet powerful techniques in machine learning. We've seen how in the end it's just a local search algorithm that is trying to find a line equation that is as close as possible to all our points.
In the next post we'll see how to make use of regularization in order to avoid over-fitting our equation to the training data and we'll also explore how to evolve this technique to solve problems with multiple variables.
In the next post we'll see how to make use of regularization in order to avoid over-fitting our equation to the training data and we'll also explore how to evolve this technique to solve problems with multiple variables.













