Basic Mathematics for Machine Learning
There are many reasons why the mathematics is important for machine learning. Some of them are below:
There are many reasons why the mathematics is important for machine learning. Some of them are below:
- Selecting the right algorithm which includes giving considerations to accuracy, training time, model complexity, number of parameters and number of features.
- Choosing parameter settings and validation strategies.
- Identifying underfitting and overfitting by understanding the Bias-Variance tradeoff.
- Estimating the right confidence interval and uncertainty.
Calculus for Deep Learning
Scalar derivative rules
Introduction to vector calculus and partial derivatives
Neural network layers are not single functions of a single parameter, f(x). So, let’s move on to functions of multiple parameters such as f(x,y). For example, what is the derivative of xy (i.e., the multiplication of x and y)?
Well, it depends on whether we are changing x or y. We compute derivatives with respect to one variable (parameter) at a time, giving us two different partial derivatives for this two-parameter function (one for x and one for y). Instead of using operator d/dx, the partial derivative operator is ∂/ ∂x (a stylized d and not the Greek letter δ ). So ∂(xy)/ ∂x and ∂(xy)/ ∂y are the partial derivatives of xy; often, these are just called the partials.
The partial derivative with respect to x is just the usual scalar derivative, simply treating any other variable in the equation as a constant. Consider function f(x,y) = 3x²y. The partial derivative with respect to x is written ∂(3x²y)/ ∂x. There are three constants from the perspective of ∂/ ∂x: 3, 2, and y. Therefore, ∂(3x²y)/ ∂x = 3y∂(x²)/ ∂x = 3y(2x) = 6xy. The partial derivative with respect to y treats x like a constant and we get ∂(3x²y)/ ∂y = 3x².
So from above example if f(x,y) = 3x²y, then
Neural network layers are not single functions of a single parameter, f(x). So, let’s move on to functions of multiple parameters such as f(x,y). For example, what is the derivative of xy (i.e., the multiplication of x and y)?
Well, it depends on whether we are changing x or y. We compute derivatives with respect to one variable (parameter) at a time, giving us two different partial derivatives for this two-parameter function (one for x and one for y). Instead of using operator d/dx, the partial derivative operator is ∂/ ∂x (a stylized d and not the Greek letter δ ). So ∂(xy)/ ∂x and ∂(xy)/ ∂y are the partial derivatives of xy; often, these are just called the partials.
The partial derivative with respect to x is just the usual scalar derivative, simply treating any other variable in the equation as a constant. Consider function f(x,y) = 3x²y. The partial derivative with respect to x is written ∂(3x²y)/ ∂x. There are three constants from the perspective of ∂/ ∂x: 3, 2, and y. Therefore, ∂(3x²y)/ ∂x = 3y∂(x²)/ ∂x = 3y(2x) = 6xy. The partial derivative with respect to y treats x like a constant and we get ∂(3x²y)/ ∂y = 3x².
So from above example if f(x,y) = 3x²y, then
So the gradient of f(x,y) is simply a vector of its partial.
Matrix calculus
When we move from derivatives of one function to derivatives of many functions, we move from the world of vector calculus to matrix calculus. Let us bring one more function g(x,y) = 2x + y⁸. So gradient of g(x,y) is
Matrix calculus
When we move from derivatives of one function to derivatives of many functions, we move from the world of vector calculus to matrix calculus. Let us bring one more function g(x,y) = 2x + y⁸. So gradient of g(x,y) is
Gradient vectors organize all of the partial derivatives for a specific scalar function. If we have two functions, we can also organize their gradients into a matrix by stacking the gradients. When we do so, we get the Jacobian matrix (or just the Jacobian) where the gradients are rows:
Generalization of the Jacobian
To define the Jacobian matrix more generally, let’s combine multiple parameters into a single vector argument: f(x,y,z) => f(x). Lowercase letters in bold font such as x are vectors and those in italics font like x are scalars. xi is the ith element of vector x and is in italics because a single vector element is a scalar. We also have to define an orientation for vector x. We’ll assume that all vectors are vertical by default of size n X 1:
To define the Jacobian matrix more generally, let’s combine multiple parameters into a single vector argument: f(x,y,z) => f(x). Lowercase letters in bold font such as x are vectors and those in italics font like x are scalars. xi is the ith element of vector x and is in italics because a single vector element is a scalar. We also have to define an orientation for vector x. We’ll assume that all vectors are vertical by default of size n X 1:
With multiple scalar-valued functions, we can combine them all into a vector just like we did with the parameters. Let y = f(x) be a vector of m scalar-valued functions that each take a vector x of length n= |x| where |x| is the cardinality (count) of elements in x. Each fi function within f returns a scalar just as in the previous section
Generally speaking, though, the Jacobian matrix is the collection of all m X n possible partial derivatives (m rows and n columns), which is the stack of m gradients with respect to x:
Derivatives of vector element-wise binary operators
By “element-wise binary operations” we simply mean applying an operator to the first item of each vector to get the first item of the output, then to the second items of the inputs for the second item of the output, and so forth.We can generalize the element-wise binary operations with notation y= f(w) O g(x) where m = n = |y| = |w| = |x|. The O symbol represents any element-wise operator (such as +) and not the o function composition operator.
That’s quite a furball, but fortunately the Jacobian is very often a diagonal matrix, a matrix that is zero everywhere but the diagonal.
Vector sum reduction
Summing up the elements of a vector is an important operation in deep learning, such as the network loss function, but we can also use it as a way to simplify computing the derivative of vector dot product and other operations that reduce vectors to scalars.
Let y=sum(f(x)) = Σ fi(x). Notice we were careful here to leave the parameter as a vector x because each function fi could use all values in the vector, not just xi. The sum is over the results of the function and not the parameter. The gradient ( 1 X n Jacobian) of vector summation is:
Vector sum reduction
Summing up the elements of a vector is an important operation in deep learning, such as the network loss function, but we can also use it as a way to simplify computing the derivative of vector dot product and other operations that reduce vectors to scalars.
Let y=sum(f(x)) = Σ fi(x). Notice we were careful here to leave the parameter as a vector x because each function fi could use all values in the vector, not just xi. The sum is over the results of the function and not the parameter. The gradient ( 1 X n Jacobian) of vector summation is:
The Chain Rules
We can’t compute partial derivatives of very complicated functions using just the basic matrix calculus rules. Part of our goal here is to clearly define and name three different chain rules and indicate in which situation they are appropriate.
The chain rule is conceptually a divide and conquer strategy (like Quicksort) that breaks complicated expressions into sub-expressions whose derivatives are easier to compute. Its power derives from the fact that we can process each simple sub-expression in isolation yet still combine the intermediate results to get the correct overall result.
The chain rule comes into play when we need the derivative of an expression composed of nested subexpressions. For example, we need the chain rule when confronted with expressions like d(sin(x²))/dx.
- Single-variable chain rule :- Chain rules are typically defined in terms of nested functions, such as y= f(u) where u= g(x) so y= f(g(x)) for single-variable chain rules.
To deploy the single-variable chain rule, follow these steps:
- Introduce intermediate variables for nested sub-expressions and sub-expressions for both binary and unary operators; example, X is binary, sin (x) and other trigonometric functions are usually unary because there is a single operand. This step normalizes all equations to single operators or function applications.
- Compute derivatives of the intermediate variables with respect to their parameters.
- Combine all derivatives of intermediate variables by multiplying them together to get the overall result.
- Substitute intermediate variables back in if any are referenced in the derivative equation.
- Single-variable total-derivative chain rule :- The total derivative assumes all variables are potentially codependent whereas the partial derivative assumes all variables but x are constants.
This chain rule that takes into consideration the total derivative degenerates to the single-variable chain rule when all intermediate variables are functions of a single variable.
A word of caution about terminology on the web. Unfortunately, the chain rule given in this section, based upon the total derivative, is universally called “multivariable chain rule” in calculus discussions, which is highly misleading! Only the intermediate variables are multivariate functions.
A word of caution about terminology on the web. Unfortunately, the chain rule given in this section, based upon the total derivative, is universally called “multivariable chain rule” in calculus discussions, which is highly misleading! Only the intermediate variables are multivariate functions.
- Vector chain rule :- Vector chain rule for vectors of functions and a single parameter mirrors the single-variable chain rule.
The goal is to convert the above vector of scalar operations to a vector operation. So the above RHS matrix can also be implemented as a product of vector multiplication.
That means that the Jacobian is the multiplication of two other Jacobians. To make this formula work for multiple parameters or vector x, we just have to change x to vector x in the equation. The effect is that ∂g/ ∂x and the resulting Jacobian, ∂f/ ∂x , are now matrices instead of vertical vectors. Our complete vector chain rule is:
Please note here that matrix multiply does not commute, the order of (∂f/ ∂x)(∂g/ ∂x) matters.For completeness, here are the two Jacobian components :-
where m = |f|, n = |x| and k = |g|. The resulting Jacobian is m X n. (an m X k matrix multiplied by a k X n matrix).
We can simplify further because, for many applications, the Jacobians are square (m = n) and the off-diagonal entries are zero.
We can simplify further because, for many applications, the Jacobians are square (m = n) and the off-diagonal entries are zero.
Resources
1.The original paper.
2. There are some online tools which can differentiate a matrix for you: 3. More matrix calculus.
1.The original paper.
2. There are some online tools which can differentiate a matrix for you: 3. More matrix calculus.
Linear Algebra
Linear algebra is a form of continuous rather than discrete mathematics, many computer scientists have little experience with it. A good understanding of linear algebra is essential for understanding and working with many machine learning algorithms, especially deep learning algorithms.
Why Math?
Linear algebra, probability and calculus are the ‘languages’ in which machine learning is formulated. Learning these topics will contribute a deeper understanding of the underlying algorithmic mechanics and allow development of new algorithms.
When confined to smaller levels, everything is math behind deep learning. So it is essential to understand basic linear algebra before getting started with deep learning and programming it.
Why Math?
Linear algebra, probability and calculus are the ‘languages’ in which machine learning is formulated. Learning these topics will contribute a deeper understanding of the underlying algorithmic mechanics and allow development of new algorithms.
When confined to smaller levels, everything is math behind deep learning. So it is essential to understand basic linear algebra before getting started with deep learning and programming it.
The core data structures behind Deep-Learning are Scalars, Vectors, Matrices and Tensors. Programmatically, let’s solve all the basic linear algebra problems using these.
Scalars
Scalars are single numbers and are an example of a 0th-order tensor. The notation x ∈ ℝ states that x is a scalar belonging to a set of real-values numbers, ℝ.
There are different sets of numbers of interest in deep learning. ℕ represents the set of positive integers (1,2,3,…). ℤ designates the integers, which combine positive, negative and zero values. ℚ represents the set of rational numbers that may be expressed as a fraction of two integers.
Few built-in scalar types are int, float, complex, bytes, Unicode in Python. In In NumPy a python library, there are 24 new fundamental data types to describe different types of scalars. For information regarding datatypes refer documentation here.
Defining Scalars and Few Operations in Python:
The following code snippet explains few arithmetic operations on Scalars.
The following code snippet checks if the given variable is scalar or not.
Vectors
Vectors are ordered arrays of single numbers and are an example of 1st-order tensor. Vectors are fragments of objects known as vector spaces. A vector space can be considered of as the entire collection of all possible vectors of a particular length (or dimension). The three-dimensional real-valued vector space, denoted by ℝ^3 is often used to represent our real-world notion of three-dimensional space mathematically.
Vectors are ordered arrays of single numbers and are an example of 1st-order tensor. Vectors are fragments of objects known as vector spaces. A vector space can be considered of as the entire collection of all possible vectors of a particular length (or dimension). The three-dimensional real-valued vector space, denoted by ℝ^3 is often used to represent our real-world notion of three-dimensional space mathematically.
To identify the necessary component of a vector explicitly, the ith scalar element of a vector is written as x[i].
In deep learning vectors usually represent feature vectors, with their original components defining how relevant a particular feature is. Such elements could include the related importance of the intensity of a set of pixels in a two-dimensional image or historical price values for a cross-section of financial instruments.
Defining Vectors and Few Operations in Python:
In deep learning vectors usually represent feature vectors, with their original components defining how relevant a particular feature is. Such elements could include the related importance of the intensity of a set of pixels in a two-dimensional image or historical price values for a cross-section of financial instruments.
Defining Vectors and Few Operations in Python:
Vector multiplication
There are some types of vector multiplication: Dot product ,Cross product and Hadamard product.
Dot product
The dot product of two vectors is a scalar. Dot product of vectors and matrices (matrix multiplication) is one of the most important operations in deep learning.
There are some types of vector multiplication: Dot product ,Cross product and Hadamard product.
Dot product
The dot product of two vectors is a scalar. Dot product of vectors and matrices (matrix multiplication) is one of the most important operations in deep learning.
Hadamard product
Hadamard Product is elementwise multiplication and it outputs a vector.
Hadamard Product is elementwise multiplication and it outputs a vector.
Cross product
Cross Product is a multiplication and it outputs a vector.It's not commutatitve but anti-commutative.
Cross Product is a multiplication and it outputs a vector.It's not commutatitve but anti-commutative.
This determinant can be computed using Sarrus's rule or cofactor expansion. Using Sarrus's rule, it expands to
Using cofactor expansion along the first row instead, it expands to
which gives the components of the resulting vector directly.
Vector fields
A vector field shows how far the point (x,y) would hypothetically move if we applied a vector function to it like addition or multiplication. Given a point in space, a vector field shows the power and direction of our proposed change at a variety of points in a graph.
A vector field shows how far the point (x,y) would hypothetically move if we applied a vector function to it like addition or multiplication. Given a point in space, a vector field shows the power and direction of our proposed change at a variety of points in a graph.
This vector field is an interesting one since it moves in different directions depending the starting point. The reason is that the vector behind this field stores terms like 2x or x² instead of scalar values like -2 and 5. For each point on the graph, we plug the x-coordinate into 2x or x² and draw an arrow from the starting point to the new location. Vector fields are extremely useful for visualizing machine learning techniques like Gradient Descent.
|
Matrices
Matrix dimensions
We describe the dimensions of a matrix in terms of rows by columns.
Matrices are rectangular arrays consisting of numbers and are an example of 2nd-order tensors. If m and n are positive integers, that is m, n ∈ ℕ then the m×n matrix contains m*n numbers, with m rows and n columns.
The full m×n matrix can be written as:
The full m×n matrix can be written as:
It is often useful to abbreviate the full matrix component display into the following expression:
In Python, We use numpy library which helps us in creating ndimensional arrays. Which are basically matrices, we use matrix method and pass in the lists and thereby defining a matrix
$python
$python
Defining Matrices and Few Operations in Python:
Matrix Addition
Matrices can be added to scalars, vectors and other matrices. Each of these operations has a precise definition. These techniques are used frequently in machine learning and deep learning so it is worth familiarising yourself with them.
Matrix Addition
Matrices can be added to scalars, vectors and other matrices. Each of these operations has a precise definition. These techniques are used frequently in machine learning and deep learning so it is worth familiarising yourself with them.
Matrix-Matrix Addition
C = A + B (Shape of A and B should be equal)
The methods shape return the shape of the matrix, and add takes in two arguments and returns the sum of those matrices. If the shape of the matrices is not same it throws an error saying, addition not possible.
C = A + B (Shape of A and B should be equal)
The methods shape return the shape of the matrix, and add takes in two arguments and returns the sum of those matrices. If the shape of the matrices is not same it throws an error saying, addition not possible.
Matrix Scalar Multiplication
Multiplies the given scalar to all the elements in the given matrix.
Multiplies the given scalar to all the elements in the given matrix.
Matrix Hadamard product
Hadamard product of matrices is an elementwise operation. Values that correspond positionally are multiplied to produce a new matrix.
In numpy you can take the Hadamard product of a matrix and vector as long as their dimensions meet the requirements of broadcasting.
Matrix Multiplication
A of shape (m x n) and B of shape (n x p) multiplied gives C of shape (m x p)
Matrix Transpose
With transposition you can convert a row vector to a column vector and vice versa:
A=[aij]mxn
AT=[aji]n×m
With transposition you can convert a row vector to a column vector and vice versa:
A=[aij]mxn
AT=[aji]n×m
Numpy broadcasting
In numpy the dimension requirements for elementwise operations are relaxed via a mechanism called broadcasting. Two matrices are compatible if the corresponding dimensions in each matrix (rows vs rows, columns vs columns) meet the following requirements:
- The dimensions are equal, or
- One dimension is of size 1
Tensors
The more general entity of a tensor encapsulates the scalar, vector and the matrix. It is sometimes necessary — both in the physical sciences and machine learning — to make use of tensors with order that exceeds two.
We use Python libraries like tensorflow or PyTorch in order to declare tensors, rather than nesting matrices.
To define a simple tensor in PyTorch
To define a simple tensor in PyTorch
Few Arithmetic Operations on Tensors in Python
Basic Probability Theory and Statistics
I want to discuss some very fundamental terms/concepts related to probability and statistics that often come across any literature related to Machine Learning and AI.
Random Experiment
A random experiment is a physical situation whose outcome cannot be predicted until it is observed.
Sample Space
A sample space, is a set of all possible outcomes of a random experiment.
Random Variables
A random variable, is a variable whose possible values are numerical outcomes of a random experiment. There are two types of random variables.
1. Discrete Random Variable is one which may take on only a countable number of distinct values such as 0,1,2,3,4,…….. Discrete random variables are usually (but not necessarily) counts.
2. Continuous Random Variable is one which takes an infinite number of possible values. Continuous random variables are usually measurements.
Probability
Probability is the measure of the likelihood that an event will occur in a Random Experiment. Probability is quantified as a number between 0 and 1, where, loosely speaking, 0 indicates impossibility and 1 indicates certainty. The higher the probability of an event, the more likely it is that the event will occur.
Example
A simple example is the tossing of a fair (unbiased) coin. Since the coin is fair, the two outcomes (“heads” and “tails”) are both equally probable; the probability of “heads” equals the probability of “tails”; and since no other outcomes are possible, the probability of either “heads” or “tails” is 1/2 (which could also be written as 0.5 or 50%).
Conditional Probability
Conditional Probability is a measure of the probability of an event given that (by assumption, presumption, assertion or evidence) another event has already occurred. If the event of interest is A and the event B is known or assumed to have occurred, “the conditional probability of A given B”, is usually written as P(A|B).
Independence
Two events are said to be independent of each other, if the probability that one event occurs in no way affects the probability of the other event occurring, or in other words if we have observation about one event it doesn’t affect the probability of the other. For Independent events A and B below is true
Example
Let’s say you rolled a die and flipped a coin. The probability of getting any number face on the die is no way influences the probability of getting a head or a tail on the coin.
Conditional Independence
Two events A and B are conditionally independent given a third event C precisely if the occurrence of A and the occurrence of B are independent events in their conditional probability distribution given C. In other words, A and B are conditionally independent given C if and only if, given knowledge that C already occurred, knowledge of whether A occurs provides no additional information on the likelihood of B occurring, and knowledge of whether B occurs provides no additional information on the likelihood of A occurring.
Example
A box contains two coins, a regular coin and one fake two-headed coin (P(H)=1P(H)=1). I choose a coin at random and toss it twice.
Let
A = First coin toss results in an HH.
B = Second coin toss results in an HH.
C = Coin 1 (regular) has been selected.
If C is already observed i.e. we already know whether a regular coin is selected or not, the event A and B becomes independent as the outcome of 1 doesn’t affect the outcome of other event.
Expectation
The expectation of a random variable X is written as E(X). If we observe N random values of X, then the mean of the N values will be approximately equal to E(X) for large N. In more concrete terms, the expectation is what you would expect the outcome of an experiment to be on an average if you repeat the experiment a large number of time.
Variance
The variance of a random variable X is a measure of how concentrated the distribution of a random variable X is around its mean. It’s defined as
The square root of the variance is known as the standard deviation.
Probability Distribution
Is a mathematical function that maps the all possible outcomes of an random experiment with it’s associated probability. It depends on the Random Variable X , whether it’s discrete or continues.
1. Discrete Probability Distribution: The mathematical definition of a discrete probability function, p(x), is a function that satisfies the following properties. This is referred as Probability Mass Function.
2. Continuous Probability Distribution: The mathematical definition of a continuous probability function, f(x), is a function that satisfies the following properties. This is referred as Probability Density Function.
Joint Probability Distribution
If X and Y are two random variables, the probability distribution that defines their simultaneous behaviour during outcomes of a random experiment is called a joint probability distribution. Joint distribution function of X and Y ,defined as
Conditional Probability Distribution (CPD)
If Z is random variable who is dependent on other variables X and Y, then the distribution of P(Z|X,Y) is called CPD of Z w.r.t X and Y. It means for every possible combination of random variables X, Y we represent a probability distribution over Z.
Example
There is a student who has a property called ‘Intelligence’ which can be either low(I_0)/high(I_1). He/She enrolls to a course, The course has property called ‘Difficulty’ which can take binary values easy(D_0)/difficult(D_1). And the student gets a ‘Grade’ in the course based on his performance, and grade can take 3 values G_1(Best)/(G_2)/(G_3)(Worst). Then the CPD P(G|I,D) is as follow
There are a number of operations that one can perform over any probability distribution to get interesting results. Some of the important operations are as below.
1. Conditioning/Reduction
If we have a probability distribution of n random variables X1, X2 … Xn and we make an observation about k variables that they acquired certain values a1, a2, …, ak. It means we already know their assignment. Then the rows in the JD which are not consistent with the observation is simply can removed and that leave us with lesser number of rows. This operation is known as Reduction.
2. Marginalisation
This operation takes a probability distribution over a large set random variables and produces a probability distribution over a smaller subset of the variables. This operation is known as marginalising a subset of random variables. This operation is very useful when we have large set of random variables as features and we are interested in a smaller set of variables, and how it affects output. For ex.
Factor
A factor is a function or a table which takes a number of random variables {X_1, X_2,…,X_n} as an argument and produces a real number as a output. The set of input random variables are called scope of the factor. For example Joint probability distribution is a factor which takes all possible combinations of random variables as input and produces a probability value for that set of variables which is a real number. Factors are the fundamental block to represent distributions in high dimensions and it support all basic operations that join distributions can be operated up on like product, reduction and marginalisation.
Factor Product
We can do factor products and the result will also be a factor. For example:

Bayes’ Rule
Bayes’ Rule : Bayes’ theorem is a formula that describes how to update the probabilities of hypotheses when given evidence. It follows simply from the axioms of conditional probability, but can be used to powerfully reason about a wide range of problems involving belief updates. We often find ourselves in a situation where we know P(y | x) and need to know P(x | y). Fortunately, if we also know P(x), we can compute the desired quantity
Common Probability Distributions
Some of the common probability distributions used in machine learning are as follows:
Bernoulli Distribution : It is a distribution over a single binary random variable. It is controlled by a single parameter φ ∈ [0,1], which gives the probability of the random variable being equal to 1.
Multinoulli Distribution : The multinoulli, or categorical,distribution is a distribution over a single discrete variable with k different states, where k is finite. Multinoulli distributions are often used to refer to distributions over categories of objects.
Gaussian Distribution : The most commonly used distribution over real numbers is the normal distribution, also known as the Gaussian distribution.
The two parameters µ ∈ R and σ ∈ (0, ∞) control the normal distribution. The parameter µ gives the coordinate of the central peak. This is also the mean of the distribution : E[x] =µ. The standard deviation of the distribution is given by σ, and the variance by σ².
Prerequisites Tests