💊 Most machine learning research is about going from mathematical modeling to ML model implementation. Here’s how to go from conditional probability to a neural architecture.
Let's start by defining a simple conditional probability problem. Consider a supervised learning task where we have input data X and target data Y, and we want to model the conditional probability P(Y | X), meaning the probability of Y given X.
A common way to model this in machine learning is to assume that this probability follows some parametric form and then use the data to estimate the parameters of this model.
For instance, we could assume that P(Y | X) is a Gaussian distribution with mean µ(X) and standard deviation σ(X). This mean µ(X) and standard deviation σ(X) could be any functions of X, but in order to learn them from data, we often assume they can be parameterized with some parameters θ, and are differentiable with respect to these parameters.
This is where neural networks come in. A neural network is just a function approximator that's highly flexible and differentiable, making it suitable to represent these functions µ(X) and σ(X).
Let's assume that our neural network is a simple feed-forward network with parameters θ. Then we can write our model as:
µ(X; θ) = NN_µ(X; θ)
σ(X; θ) = NN_σ(X; θ)
P(Y | X; θ) = N(Y; NN_µ(X; θ), NN_σ(X; θ)^2)
Here, NN_µ and NN_σ are two neural networks which take the same input X and share the same parameters θ, and N is the Gaussian distribution. Their outputs represent the mean and standard deviation of the Gaussian distribution of Y given X.
To train this model, we would use a method called maximum likelihood estimation (MLE), which aims to find the parameters θ that maximize the likelihood of the observed data.
For our Gaussian model, this corresponds to minimizing the mean squared error between Y and NN_µ(X; θ).
Below, you can see how we might implement this in code using PyTorch.
In this code, we have a neural network that outputs two values for each input: a mean and a standard deviation. The loss function is defined as the negative log-likelihood of the Gaussian distribution, which we try to minimize using gradient descent.
💡 Get technical insights just like this to help you become a better ML practitioner here:
lnkd.in/dSgjuK5Z
#machinelearning #conditionalprobability