Math is All you need
本文最后更新于:2024年4月12日 晚上
ML数学学习
目录
math4ML
Vector Calculus
Paritial Differentiation and Gradients
Gradient:
Chain Rule
function of two variable x1,x2, and both of them are functions of t
if x1 is function of multple variants the chain rule are same
所以,变量的微分是可以写成矩阵的形式
Gradients of Vector-Valued Functions
now we consider vector-valued function
scaling
reparametrization trick??
重参数化技巧(Reparametrization Trick)是一种在变分自编码器(Variational Autoencoders,VAEs)等深度学习模型中常用的技巧,用于解决梯度无法通过随机节点反向传播的问题。
在 VAEs 中,我们希望通过反向传播来优化隐变量的分布参数。然而,由于隐变量是从某个分布(如高斯分布)中采样得到的,这个采样过程是随机的,无法直接进行反向传播。
重参数化技巧的思想是将这个随机过程转化为一个确定性的过程。具体来说,假设我们希望采样的隐变量 z 服从参数为 μ 和 σ 的高斯分布,我们可以先采样一个标准正态分布 ε,然后通过 z = μ + σε 来得到 z。这样,原本需要对 z 进行反向传播的问题就转化为了对 μ 和 σ 进行反向传播,而 μ 和 σ 是可以通过反向传播来优化的。实际上这些映射都是在做colums to columns的
i.e. Gradient of a Least-Squares Loss
Gradients of Matrices
一个有趣的事情是,由于matrices是线性的映射,我们把
two way to understand


for the first case let's considering that:
to solve it
so
对于
Backpropagation and Automatic Differentiation
Probability and Distribution
Discrete and Continuous Probabilitie
quantity of interest:some characteristic of the distribution of a population random variable.
Posterior
点积法则:联合概率分布可以条件因式分解
同时在贝叶斯统计分析中,我们对于:观察到一些随机变量之后,我们对于推断latent(unobserved)
random variables 非常感兴趣,也即是 assume we have prior knowledge p(x)
about an unserved random variable
quantity
Summary Statistics and Independence
Means and Covariances
multivariate
Covariance
同时方差也是特殊的斜方差矩阵(covariance matrix)
Empirical Means and Covariances
In ML, we need to learn from empirical observations of data.There are two conceptual steps to go from population statistics to the realization of empirical statistics. 1. use a finite dataset to construct an empirical statistic that is a function of a finite number of identical random variables 2. observe the data
Empirical Mean and Covariance:
Sums and Transformation of Random Variables
Conditional Independence:Two random variables X and
Y are conditionally independent given Z if and only if
Gaussian Distrubtion
multivariate Gaussion
Marginals and Conditionals of Gaussians are Gaussians

此时,y的分布是固定的,我们在测量x在给定y的情况下的分布情况。
Product of Gaussian Densities
given that
Useful Properties of Common Function
DL-mathWhen Models Meet Data
One of the guiding principles of machine learning is that good models should perform well on unseen data.
Learning is Finding Parameters
- model作为function,称为prediction,model作为概率模型,prediction phase 被称为 inference
- Training or parameter estimation:finding the best predictor based on some measure of quality( point estimate) 或者 using Bayesian inference
Empirical Risk Minimization
Hypothesis Class of Functions
idea: What is the set of functions we allow the predictor to take?
considering that:
同时有以下假设
- set of examples is independent and identically distributed.
可以用,empirical mean
Regularization to Reduce Overfitting: As trainning set is used to fit the model,the test set is used to evaluating generalization performance.
Parameter Estimation
MLE
It is a distribution that models the uncertainty of the data for a given parameter setting. For a given dataset x, the likelihood allows us to express preferences about different settings of the parameters θ, andwe can choose the setting that more “likely” has generated the data.
The maximum likelihood estimate
Maximum A Posteriori Estimation
在已知
Probabilistic Modeling and Inference
To make this task more tractable, we often build models that describe the generative process that generates the observed data.
#### Probabilistic Models
In probabilistic modeling, the joint distribution p(x, θ) of the observed variables x and the hidden parameters θ is of central importance: It encapsulates information from the following 1. The prior and the likelihood 2. The marginal likelihood p(x), which will play an important role in model selection , can be computed by taking the joint distribution and integrating out the parameters。 3. The posterior, which can be obtained by dividing the joint by the marginal likelihood.
but:focusing solely on some statistic of the posterior distribution, leads to loss of information,
math4diffusion
math-base
base
概率模型中的参数:隔开随机变量X和参数
信息
信息是用来消除随机不确定的东西,也就是衡量信息的大小就是用这个信息消除不确定性的程度,信息的大小和信息发生的概率成反比,概率大,信息小,
信息量:
信息熵:所有信息量的期望
同时我们定义第二项为交叉熵,同时也就是KL散度 = 交叉熵 - 信息熵
Note
- 真实概率分布p(x)是确定的,也就是我们拟定的标签分布,我们用这个分布去监督预测q(x)
- 我们用KL散度的值表示真实概率分布p ( x )与预测概率分布q ( x ) 之间的差异,值越小表示预测的结果越好。但这时KL散度其实就是交叉熵减一个常量(信息熵),因此直接最小化交叉熵就相当于最小化KL散度。
jensen 不等式
凸函数
能量模型EMBMS
对于给定的数据集,不知道潜在的分布形式,无法写出四然函数,但是任何概率模型都能转换成基于能量的模型,所以利用基于能量的模型的形式,是一种学习概率分布的通法
常用的E可以写成关于
证据下界
对于一个无法极大化的公式,通过寻找一个严格下界:形式简单,同时永远小于等于原来的等式,同时由于在对数中存在积分,所以无法在对数中对式子进行展开,同时积分也是难以计算的,所以无法极大化对数似然进行参数求解。$$
Note:
下界函数确实是存在的,同时下届函数等于对数似然函数,同时只要令同时可以考虑先验的分解方式
score-base
我们需要找到SF使得和真实的数据完成匹配接近
对于第二项,可以化简为

model
math4DL
1 |
|