Deeplearning

本文最后更新于:2024年2月13日 晚上

花书

Deep Feedforward Networks

We can think ofφas providing a set of features describing x, oras providing a new representation for x. The question is then how to choose the mapping φ。 1. generic , if is of high enough dimension, we can always have enough capacity to fit thetraining set, but generalization to the test set often remains poorgeneric feature mappings are usually based only on the principle of localsmoothness and do not encode enough prior information to solve advancedproblems. 2. manually engineer φ 3. The strategy of deep learning is to learnφ

Gradient-Based Learning

neural networks are usuallytrained by using iterative, gradient-based optimizers that merely drive the costfunction to a very low value,For feedforward neural networks, it is important toinitialize all weights to small random values. The biases may be initialized to zeroor to small positive values

cost function

One recurring theme throughout neural network design is that the gradient ofthe cost function must be large and predictable enough to serve as a good guidefor the learning algorithm.Functions that saturate (become very flat) underminethis objective because they make the gradient become very small. In many casesthis happens because the activation functions used to produce the output of thehidden units or the output units saturate.The negative log-likelihood helps to avoid this problem for many models.

One unusual property of the cross-entropy cost used to perform maximumlikelihood estimation is that it usually does not have a minimum value when appliedto the models commonly used in practice.

此外我们通常希望学习:在给定x的统计特征后,可以得到y的分布。神经网络通常是在学习being able to represent any functionffrom a wide class of functions,with this class being limited only by features such as continuity and boundedness,rather than by having a specific parametric form.wecan view the cost function as being afunctional rather than just a function. Afunctional is a mapping from functions to real numbers.We can thus think oflearning as choosing a function rather than merely choosing a set of parameters.We can design our cost functional to have its minimum occur at some specific function we desire.

变分理论(calculus of variations):

solving the optimizatin problem yield so long as this function lies within the class we optimize over.也就是说只要能在真实数据分布中无限采样,最小化MSE可以给每一个x预测y的均值。

但即使是这样,saturate现象也经常在这种现象中出现

Output Units

Linear Units for Gaussian Output Distribution:线性输出单元经常被用来生成条件高斯分布同时最大化log似然函数和最小化均方差是等价的。最大似然框架可以直接学习高斯分布的的协方差,到那时协方差必须是正定的,线性输出单元很难满足这样的性质。

Sigmoid Units for Bernoulli Output Distributions

作为参数可以用

softmax units for multinoulli output distributions:softmax functions are most used as the output of a classifier,to represent the probability distribution over n different class where ,and the function can be given by


Deeplearning
http://example.com/2024/02/12/deeplearning/
作者
NGC6302
发布于
2024年2月12日
许可协议