Math is All you need

本文最后更新于:2024年4月12日 晚上

ML数学学习

MML

数学公式支持

目录

math4ML

Vector Calculus

Paritial Differentiation and Gradients

Gradient: so we define,同时对于的微分,同样的遵循微分法则。

Chain Rule

function of two variable x1,x2, and both of them are functions of t

if x1 is function of multple variants the chain rule are same

所以,变量的微分是可以写成矩阵的形式

Gradients of Vector-Valued Functions

now we consider vector-valued function and corresponding vector of function value is 同时每一个映射为1维度的函数可以展开为一个1xn的vector。PD of a vector-valued function with respect to : 所以对于多维矩阵的展开求导有相同多位展开方式最后的样式为 jacobian:The collection of all first-order partial derivatives. And the amount of scaling due to the transformation of a variable is provided by the determination.

scaling and there is and we have so do b2 and Jacobian determinant |det(J)| = 3 the area of the square spanned by C is three times greater than the area spanned by B。The conclusion can also be descrpted by Jacobian.

reparametrization trick??

重参数化技巧(Reparametrization Trick)是一种在变分自编码器(Variational Autoencoders,VAEs)等深度学习模型中常用的技巧,用于解决梯度无法通过随机节点反向传播的问题。

在 VAEs 中,我们希望通过反向传播来优化隐变量的分布参数。然而,由于隐变量是从某个分布(如高斯分布)中采样得到的,这个采样过程是随机的,无法直接进行反向传播。

重参数化技巧的思想是将这个随机过程转化为一个确定性的过程。具体来说,假设我们希望采样的隐变量 z 服从参数为 μ 和 σ 的高斯分布,我们可以先采样一个标准正态分布 ε,然后通过 z = μ + σε 来得到 z。这样,原本需要对 z 进行反向传播的问题就转化为了对 μ 和 σ 进行反向传播,而 μ 和 σ 是可以通过反向传播来优化的。

实际上这些映射都是在做colums to columns的

i.e. Gradient of a Least-Squares Loss

同时 dth element is

Gradients of Matrices

一个有趣的事情是,由于matrices是线性的映射,我们把空间映射到 理论上我们需要四维空间来完成相关的变换,实际上,我们可以直接把展开为,同样的替换雅可比行列式可以降低为2维度的矩阵

two way to understand
jacob1
jacob2

for the first case let's considering that:

to solve it

so

对于,进行求导会进行一次转置,然后可以通过堆叠的方式完成。

Backpropagation and Automatic Differentiation

is the parameter to be found.Let's define 所以我们只需要计算每一层output 关于input的导数,同时在每一层计算input关于参数的导数,两者相乘既可以完成反向传播的过程

Probability and Distribution

Discrete and Continuous Probabilitie

quantity of interest:some characteristic of the distribution of a population random variable.

Posterior is quantity of interest in Bayesian statjstics. i.e., know aboout after having observed

点积法则:联合概率分布可以条件因式分解

同时在贝叶斯统计分析中,我们对于:观察到一些随机变量之后,我们对于推断latent(unobserved) random variables 非常感兴趣,也即是 assume we have prior knowledge p(x) about an unserved random variable 以及如果有已知的 p 使得两者能够产生关联 ,如果we can observe ,so wo draw a conclusion about posteriror p. note: 1. p is the prior,which encapsulate subjective prior knowledge of the latent variable before observing any data ,也就是x是我们需要假定的分布 2. likelihood p describe how and are related Note that the likelihood is not a distribution in x, but only in y

quantity 𝕩

Summary Statistics and Independence

Means and Covariances

multivariate

𝕩𝟙

Covariance

同时方差也是特殊的斜方差矩阵(covariance matrix)

Empirical Means and Covariances

In ML, we need to learn from empirical observations of data.There are two conceptual steps to go from population statistics to the realization of empirical statistics. 1. use a finite dataset to construct an empirical statistic that is a function of a finite number of identical random variables 2. observe the data

Empirical Mean and Covariance:

Sums and Transformation of Random Variables

𝕩𝕩𝕟𝕥𝕩

Conditional Independence:Two random variables X and Y are conditionally independent given Z if and only if The interpretation of the equation,can be understood as “givenknowledge about z, the distribution of x and y factorizes”同时也就是在的条件判断中,给定了z的分布再加上y的分布和直接给定z的分布获得的概率是一样的。

Gaussian Distrubtion

multivariate Gaussion

Marginals and Conditionals of Gaussians are Gaussians

条件分布

此时,y的分布是固定的,我们在测量x在给定y的情况下的分布情况。

Product of Gaussian Densities

given that ,product of them can be expressed by

Useful Properties of Common Function

DL-math

When Models Meet Data

One of the guiding principles of machine learning is that good models should perform well on unseen data.

Learning is Finding Parameters

  1. model作为function,称为prediction,model作为概率模型,prediction phase 被称为 inference
  2. Training or parameter estimation:finding the best predictor based on some measure of quality( point estimate) 或者 using Bayesian inference

Empirical Risk Minimization

Hypothesis Class of Functions

idea: What is the set of functions we allow the predictor to take?

considering that: \dots is only supported in math mode f(\mathbf{x}_n,\mathbf{\theta^*}) \approx y_n ~~\text{for all n=1,\dots ,n}

同时有以下假设

  1. set of examples is independent and identically distributed. 可以用,empirical mean

Regularization to Reduce Overfitting: As trainning set is used to fit the model,the test set is used to evaluating generalization performance.

Parameter Estimation

MLE

It is a distribution that models the uncertainty of the data for a given parameter setting. For a given dataset x, the likelihood allows us to express preferences about different settings of the parameters θ, andwe can choose the setting that more “likely” has generated the data.

The maximum likelihood estimate possess the following property - Asymptotic consistency: The MLE converges to the true value in the limit of infinitely many observations, plus a random error that is approximately normal. - The size of the samples necessary to achieve these properties can be quite large. - The error’s variance decays in 1/N, where N is the number of data points. - Especially, in the “small” data regime, maximum likelihood estimation can lead to overfitting.

Maximum A Posteriori Estimation

在已知

Probabilistic Modeling and Inference

To make this task more tractable, we often build models that describe the generative process that generates the observed data.

#### Probabilistic Models

In probabilistic modeling, the joint distribution p(x, θ) of the observed variables x and the hidden parameters θ is of central importance: It encapsulates information from the following 1. The prior and the likelihood 2. The marginal likelihood p(x), which will play an important role in model selection , can be computed by taking the joint distribution and integrating out the parameters。 3. The posterior, which can be obtained by dividing the joint by the marginal likelihood.

but:focusing solely on some statistic of the posterior distribution, leads to loss of information,


math4diffusion

扩散模型数学理解 aigc

math-base

base

概率模型中的参数:隔开随机变量X和参数分号前是随机变量,分号后是参数。比如P(y=1|x; θ)的意思就是“在参数为θ,已知x的条件下,y=1的概率”。其实就是P(y=1|x):“已知x的情况下,y=1的概率是多少”,只不过这个概率分布里面的参数是θ。一般求概率的时候是不写参数的,因为参数是确定的.同时分号的优先级是最高的

信息

信息是用来消除随机不确定的东西,也就是衡量信息的大小就是用这个信息消除不确定性的程度,信息的大小和信息发生的概率成反比,概率大,信息小, 信息量:

信息熵:所有信息量的期望 相对熵(kl散度):两个概率分布间的差异Note:两个概率分布之间的非对称度量 交叉熵(Cross entropy):

同时我们定义第二项为交叉熵,同时也就是KL散度 = 交叉熵 - 信息熵

Note
  1. 真实概率分布p(x)是确定的,也就是我们拟定的标签分布,我们用这个分布去监督预测q(x)
  2. 我们用KL散度的值表示真实概率分布p ( x )与预测概率分布q ( x ) 之间的差异,值越小表示预测的结果越好。但这时KL散度其实就是交叉熵减一个常量(信息熵),因此直接最小化交叉熵就相当于最小化KL散度。

jensen 不等式

凸函数 琴生不等式把上面的性质进行了拓展,对于set,

能量模型EMBMS

EMBMs

对于给定的数据集,不知道潜在的分布形式,无法写出四然函数,但是任何概率模型都能转换成基于能量的模型,所以利用基于能量的模型的形式,是一种学习概率分布的通法

常用的E可以写成关于的二阶形式,同时可以增加隐变量,增加拟合能量的精确度, RBM 有限玻尔兹曼机,只有一层隐变量,同时隐变量和可见变量为二部图

证据下界

对于一个无法极大化的公式,通过寻找一个严格下界:形式简单,同时永远小于等于原来的等式,同时由于在对数中存在积分,所以无法在对数中对式子进行展开,同时积分也是难以计算的,所以无法极大化对数似然进行参数求解。 同时分别开始先验和后验概率就上面的式子进行分解 $$

$$

Note: 下界函数确实是存在的,同时下届函数等于对数似然函数,同时只要令,下界函数等于对数似然函数,同时在现在两者的观测结果是一致的

同时可以考虑先验的分解方式

score-base

参考链接 NCSN 苏剑林 Yang Song

我们需要找到SF使得和真实的数据完成匹配接近

对于第二项,可以化简为

幽默变换?

model

math4DL

1
2
3
4
5
6
7
8
st =>start:开始
o1 =>operation:流程1
c1 =>condition:Yes or No
e1 =>end:结束

st->o1->c1
c1(yes)->e1
c1(no)->o1

Math is All you need
http://example.com/2024/02/07/Math4ML/
作者
NGC6302
发布于
2024年2月7日
许可协议