最后更新于:2022-04-01 09:52:03
作者: [龙心尘](http://blog.csdn.net/longxinchen_ml?viewmode=contents)&&[寒小阳](http://blog.csdn.net/han_xiaoyang?viewmode=contents)
### 1、引言:不要站在岸上学游泳
“机器学习”是一个很实践的过程。就像刚开始学游泳,你在只在岸上比划一堆规定动作还不如先跳到水里熟悉水性学习来得快。以我们学习“机器学习”的经验来看,很多高大上的概念刚开始不懂也没关系,先写个东西来跑跑,有个感觉了之后再学习那些概念和理论就快多了。如果别人已经做好了轮子,直接拿过来用则更快。因此,本文直接用[Michael Nielsen](http://michaelnielsen.org/)先生的代码([github地址](https://github.com/mnielsen/neural-networks-and-deep-learning.git),[压缩包地址](https://github.com/mnielsen/neural-networks-and-deep-learning/archive/master.zip))作为例子,给大家展现神经网络分析的普遍过程:导入数据,训练模型,优化模型,启发式理解等。
### 2、我们要解决的问题:手写数字识别


本文中采用的数据集就是**著名的“MNIST数据集”**。它的收集者之一是人工智能领域著名的科学家——Yann LeCu。这个数据集有60000个训练样本数据集和10000个测试用例。运用本文展示的单隐层神经网络,就可以达到96%的正确率。
### 3、图解:解决问题的思路


但是,如果我们已经生成了多个模型,怎么从中选出最好的模型?一个自然的思路就是通过比较不同模型在测试集上的误差,挑选出误差最小的模型。这个想法看似没什么问题,但是随着你测试的模型增多,你会觉得用测试集筛选出来的模型也不那么可信。比如我们增加一个神经网络的隐藏层节点,就会产生新的对应权重,产生一个新的模型。但是我也不知道增加多少个节点是合适的,所以比较全面的想法就是尝试测试不同的节点数x∈(1,2,3,4,…,100), 来观察这些不同模型的测试误差,并挑出误差最小的模型。这时我们发现我们的模型其实多出来了一个参数x, 我们挑选模型的过程就是确定最优化的参数x 的过程。这个分析过程与上面训练参数的思路如出一辙!只是这个过程是基于同一个测试集,而不训练集。那么,不同的神经网络的层数是不是也是一个新的参数y∈(1,2,3,4,…,100), 也要经过这么个过程来“训练”?
我们会发现我们之前生成模型过程中很多不变的部分其实都是可以变换调节的,这些也是新的参数,比如训练次数、梯度下降过程的步长、规范化参数、学习回合数、minibatch 值等等,我们把他们叫做超参数。超参数是影响所求参数最终取值的参数,是机器学习模型里面的框架参数,可以理解成参数的参数,它们通常是手工设定,不断试错调整的,或者对一系列穷举出来的参数组合一通进行枚举(网格搜索)来确定。但无论如何,这也是基于同样一个数据集反复验证优化的结果。在这个数据集上最后的结果并不一定在新的数据继续有效。所以为了评估这个模型的识别效果,就需要用新的测试集对模型进行考核,得出的测试结果作为对模型的评价。这个新的测试集我们就直接叫“测试集”,之前那个用于筛选超参数的测试集,我们就叫做“交叉验证集”。筛选模型的过程其实就是交叉验证的过程。


### 4、先跑跑再说:初步运行代码
Michael Nielsen的代码封装得很好,只需以下5行命令就可以生成神经网络并测试结果,并达到94.76%的正确率!。
import mnist_loader
import network
# 将数据集拆分成三个集合:训练集、交叉验证集、测试集
training_data, validation_data, test_data = mnist_loader.load_data_wrapper()
# 生成神经网络对象,神经网络结构为三层,每层节点数依次为(784, 30, 10)
net = network.Network([784, 30, 10])
# 用(mini-batch)梯度下降法训练神经网络(权重与偏移),并生成测试结果。
# 训练回合数=30, 用于随机梯度下降法的最小样本数=10,学习率=3.0
net.SGD(training_data, 30, 10, 3.0, test_data=test_data)
第二个命令的功能是:生成神经网络对象,神经网络结构为三层,每层节点数依次为(784, 30, 10)。
- 该命令设定了三个超参数:训练回合数=30, 用于随机梯度下降法的最小样本数(mini-batch-size)=10,步长=3.0。
Epoch 0: 9045 / 10000
Epoch 1: 9207 / 10000
Epoch 2: 9273 / 10000
Epoch 3: 9302 / 10000
Epoch 4: 9320 / 10000
Epoch 5: 9320 / 10000
Epoch 6: 9366 / 10000
Epoch 7: 9387 / 10000
Epoch 8: 9427 / 10000
Epoch 9: 9402 / 10000
Epoch 10: 9400 / 10000
Epoch 11: 9442 / 10000
Epoch 12: 9448 / 10000
Epoch 13: 9441 / 10000
Epoch 14: 9443 / 10000
Epoch 15: 9479 / 10000
Epoch 16: 9459 / 10000
Epoch 17: 9446 / 10000
Epoch 18: 9467 / 10000
Epoch 19: 9470 / 10000
Epoch 20: 9459 / 10000
Epoch 21: 9484 / 10000
Epoch 22: 9479 / 10000
Epoch 23: 9475 / 10000
Epoch 24: 9482 / 10000
Epoch 25: 9489 / 10000
Epoch 26: 9489 / 10000
Epoch 27: 9478 / 10000
Epoch 28: 9480 / 10000
Epoch 29: 9476 / 10000
### 5、神经网络如何识别手写数字:启发式理解




### 6、神经网络如何训练:进一步阅读代码

- 所需要求的关键参数就是:神经网络的权重(self.weights)和偏移(self.biases)。
- 超参数是:隐藏层的节点数=30,训练回合数(epochs)=30, 用于随机梯度下降法的最小样本数(mini_batch_size)=10,步长(eta)=3.0。
- 用随机梯度下降法调整参数: 
- 用反向传播法求出随机梯度下降法所需要的梯度(偏导数): backprop()
- 用输出向量减去标签向量衡量训练误差:cost_derivative() = output_activations-y
A module to implement the stochastic gradient descent learning
algorithm for a feedforward neural network. Gradients are calculated
using backpropagation. Note that I have focused on making the code
simple, easily readable, and easily modifiable. It is not optimized,
and omits many desirable features.
#### Libraries
# Standard library
import random
# Third-party libraries
import numpy as np
class Network(object):
def __init__(self, sizes):
"""The list ``sizes`` contains the number of neurons in the
respective layers of the network. For example, if the list
was [2, 3, 1] then it would be a three-layer network, with the
first layer containing 2 neurons, the second layer 3 neurons,
and the third layer 1 neuron. The biases and weights for the
network are initialized randomly, using a Gaussian
distribution with mean 0, and variance 1. Note that the first
layer is assumed to be an input layer, and by convention we
won't set any biases for those neurons, since biases are only
ever used in computing the outputs from later layers."""
self.num_layers = len(sizes)
self.sizes = sizes
self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
self.weights = [np.random.randn(y, x)
for x, y in zip(sizes[:-1], sizes[1:])]
def feedforward(self, a):
"""Return the output of the network if ``a`` is input."""
for b, w in zip(self.biases, self.weights):
a = sigmoid(np.dot(w, a)+b)
return a
def SGD(self, training_data, epochs, mini_batch_size, eta,
"""Train the neural network using mini-batch stochastic
gradient descent. The ``training_data`` is a list of tuples
``(x, y)`` representing the training inputs and the desired
outputs. The other non-optional parameters are
self-explanatory. If ``test_data`` is provided then the
network will be evaluated against the test data after each
epoch, and partial progress printed out. This is useful for
tracking progress, but slows things down substantially."""
if test_data: n_test = len(test_data)
n = len(training_data)
for j in xrange(epochs):
mini_batches = [
for k in xrange(0, n, mini_batch_size)]
for mini_batch in mini_batches:
self.update_mini_batch(mini_batch, eta)
if test_data:
print "Epoch {0}: {1} / {2}".format(
j, self.evaluate(test_data), n_test)
print "Epoch {0} complete".format(j)
def update_mini_batch(self, mini_batch, eta):
"""Update the network's weights and biases by applying
gradient descent using backpropagation to a single mini batch.
The ``mini_batch`` is a list of tuples ``(x, y)``, and ``eta``
is the learning rate."""
nabla_b = [np.zeros(b.shape) for b in self.biases]
nabla_w = [np.zeros(w.shape) for w in self.weights]
for x, y in mini_batch:
delta_nabla_b, delta_nabla_w = self.backprop(x, y)
nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
self.weights = [w-(eta/len(mini_batch))*nw
for w, nw in zip(self.weights, nabla_w)]
self.biases = [b-(eta/len(mini_batch))*nb
for b, nb in zip(self.biases, nabla_b)]
def backprop(self, x, y):
"""Return a tuple ``(nabla_b, nabla_w)`` representing the
gradient for the cost function C_x. ``nabla_b`` and
``nabla_w`` are layer-by-layer lists of numpy arrays, similar
to ``self.biases`` and ``self.weights``."""
nabla_b = [np.zeros(b.shape) for b in self.biases]
nabla_w = [np.zeros(w.shape) for w in self.weights]
# feedforward
activation = x
activations = [x] # list to store all the activations, layer by layer
zs = [] # list to store all the z vectors, layer by layer
for b, w in zip(self.biases, self.weights):
z = np.dot(w, activation)+b
activation = sigmoid(z)
# backward pass
delta = self.cost_derivative(activations[-1], y) * \
nabla_b[-1] = delta
nabla_w[-1] = np.dot(delta, activations[-2].transpose())
# Note that the variable l in the loop below is used a little
# differently to the notation in Chapter 2 of the book. Here,
# l = 1 means the last layer of neurons, l = 2 is the
# second-last layer, and so on. It's a renumbering of the
# scheme in the book, used here to take advantage of the fact
# that Python can use negative indices in lists.
for l in xrange(2, self.num_layers):
z = zs[-l]
sp = sigmoid_prime(z)
delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
nabla_b[-l] = delta
nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
return (nabla_b, nabla_w)
def evaluate(self, test_data):
"""Return the number of test inputs for which the neural
network outputs the correct result. Note that the neural
network's output is assumed to be the index of whichever
neuron in the final layer has the highest activation."""
test_results = [(np.argmax(self.feedforward(x)), y)
for (x, y) in test_data]
return sum(int(x == y) for (x, y) in test_results)
def cost_derivative(self, output_activations, y):
"""Return the vector of partial derivatives \partial C_x /
\partial a for the output activations."""
return (output_activations-y)
#### Miscellaneous functions
def sigmoid(z):
"""The sigmoid function."""
return 1.0/(1.0+np.exp(-z))
def sigmoid_prime(z):
"""Derivative of the sigmoid function."""
return sigmoid(z)*(1-sigmoid(z))
### 7、神经网络如何优化:训练超参数与多种模型对比
net = network.Network([784, 100, 10])
net.SGD(training_data, 30, 10, 3.0, test_data=test_data)
Epoch 0: 6669 / 10000
Epoch 1: 6755 / 10000
Epoch 2: 6844 / 10000
Epoch 3: 6833 / 10000
Epoch 4: 6887 / 10000
Epoch 5: 7744 / 10000
Epoch 6: 7778 / 10000
Epoch 7: 7876 / 10000
Epoch 8: 8601 / 10000
Epoch 9: 8643 / 10000
Epoch 10: 8659 / 10000
Epoch 11: 8665 / 10000
Epoch 12: 8683 / 10000
Epoch 13: 8700 / 10000
Epoch 14: 8694 / 10000
Epoch 15: 8699 / 10000
Epoch 16: 8715 / 10000
Epoch 17: 8770 / 10000
Epoch 18: 9611 / 10000
Epoch 19: 9632 / 10000
Epoch 20: 9625 / 10000
Epoch 21: 9632 / 10000
Epoch 22: 9651 / 10000
Epoch 23: 9655 / 10000
Epoch 24: 9653 / 10000
Epoch 25: 9658 / 10000
Epoch 26: 9653 / 10000
Epoch 27: 9664 / 10000
Epoch 28: 9655 / 10000
Epoch 29: 9672 / 10000
net = network.Network([784, 30, 10])
net.SGD(training_data, 30, 10, 100.0, test_data=test_data)
Epoch 0: 1002 / 10000
Epoch 1: 1002 / 10000
Epoch 2: 1002 / 10000
Epoch 3: 1002 / 10000
Epoch 4: 1002 / 10000
Epoch 5: 1002 / 10000
Epoch 6: 1002 / 10000
Epoch 7: 1002 / 10000
Epoch 8: 1002 / 10000
Epoch 9: 1002 / 10000
Epoch 10: 1002 / 10000
Epoch 11: 1002 / 10000
Epoch 12: 1001 / 10000
Epoch 13: 1001 / 10000
Epoch 14: 1001 / 10000
Epoch 15: 1001 / 10000
Epoch 16: 1001 / 10000
Epoch 17: 1001 / 10000
Epoch 18: 1001 / 10000
Epoch 19: 1001 / 10000
Epoch 20: 1000 / 10000
Epoch 21: 1000 / 10000
Epoch 22: 999 / 10000
Epoch 23: 999 / 10000
Epoch 24: 999 / 10000
Epoch 25: 999 / 10000
Epoch 26: 999 / 10000
Epoch 27: 999 / 10000
Epoch 28: 999 / 10000
Epoch 29: 999 / 10000
net = network.Network([784, 100, 10])
net.SGD(training_data, 30, 10, 0.001, test_data=test_data)
Epoch 0: 790 / 10000
Epoch 1: 846 / 10000
Epoch 2: 854 / 10000
Epoch 3: 904 / 10000
Epoch 4: 944 / 10000
Epoch 5: 975 / 10000
Epoch 6: 975 / 10000
Epoch 7: 975 / 10000
Epoch 8: 975 / 10000
Epoch 9: 974 / 10000
Epoch 10: 974 / 10000
Epoch 11: 974 / 10000
Epoch 12: 974 / 10000
Epoch 13: 974 / 10000
Epoch 14: 974 / 10000
Epoch 15: 974 / 10000
Epoch 16: 974 / 10000
Epoch 17: 974 / 10000
Epoch 18: 974 / 10000
Epoch 19: 976 / 10000
Epoch 20: 979 / 10000
Epoch 21: 981 / 10000
Epoch 22: 1004 / 10000
Epoch 23: 1157 / 10000
Epoch 24: 1275 / 10000
Epoch 25: 1323 / 10000
Epoch 26: 1369 / 10000
Epoch 27: 1403 / 10000
Epoch 28: 1429 / 10000
Epoch 29: 1451 / 10000
相关代码也在Michael Nielsen的文件中。直接引入,并运行一个方法即可。
import mnist_svm
Baseline classifier using an SVM.
9435 of 10000 values correct.
然而,实际情况并非如此。因为我们用的只是scikit-learn给支持向量机的设好的默认参数。支持向量机同样有一大堆可调的超参数,以提升模型的效果。 跟据 [Andreas Mueller](http://peekaboo-vision.blogspot.ca/)的[这篇博文](http://peekaboo-vision.blogspot.de/2010/09/mnist-for-ever.html),调整好超参数的支持向量机能够达到98.5%的准确度!比我们刚才最好的神经网络提高了1.8个百分点!
> 简单的支持向量机<浅层神经网络<调优的支持向量机<深度神经网络
但还是要提醒一下,炫酷的算法固然重要,但是良好的数据集有时候比算法更重要。Michael Nielsen专门写了一个公式来来表达他们的关系:
> 精致的算法 ≤ 简单的算法 + 良好的训练数据
sophisticated algorithm ≤ simple learning algorithm + good training data.
### 8、小结与下期预告