反向传播
反向传播(英语:Backpropagation,意为误差反向传播,缩写为BP)是对多层人工神经网络进行梯度下降的算法,也就是用链式法则以网络每层的权重为变数计算损失函数的梯度,以更新权重来最小化损失函数。
简单例子计算
符号 | |
---|---|
$ w_{ij} $ | 权重 |
$ z_i $ | 输入 |
$ y_i $ | 输出 |
$ E = \frac{1}{2}(y_p-y_a)^2$ | 损失 |
$ f(x) = \frac{1}{1+e^{-x}}$ | 激活函数 |
例如更新$w_{53}$
主要思想就是我们无法找到$E$和$w_{53}$的直接关系,所以使用链式求导法则来间接求梯度: $$ w_{53}(new) = w_{53}(old) - \Delta w_{53} $$
$$ \begin{cases} E = \frac{1}{2}(y_5-y_a)^2\\ y_5 = f(z_5)\\ z_5 = w_{53} * y_3 + w_{54} * y_4 \end{cases} $$
$$ \begin{align} \Delta w_{53} &= \frac{\partial E}{\partial w_{53}} \\ &= \frac{\partial E}{\partial y_{5}} \frac{\partial y_5}{\partial z_5} \frac{\partial z_5}{\partial w_{53}} \end{align} $$
代码实现
from math import exp, sqrt
def f(x):
"""returns sigmoid(x)"""
return 1 / (1 + exp(-x))
def d_f(x):
"""return d(sigmoid(x))/dx"""
return f(x) * (1 - f(x))
def E(yp, ya):
"""return the error in squre"""
return 1/2*(yp-ya)**2
def d_E(yp, ya):
"""return dE/d(yp)"""
return yp - ya
ya = 0.5
z = {1: 0.35, 2: 0.9, 3: 0.0, 4: 0.0, 5: 0.0}
w = {(1, 3): 0.1, (2, 3): 0.8, (1, 4): 0.4,
(2, 4): 0.6, (3, 5): 0.3, (4, 5): 0.9}
y = {i: 0.0 for i in range(1, 6)}
s = {(1, 3): 0.0, (2, 3): 0.0, (1, 4): 0.0,
(2, 4): 0.0, (3, 5): 0.0, (4, 5): 0.0}
def forward():
"""正向计算"""
y[1], y[2] = z[1], z[2]
z[3], z[4] = w[1, 3] * y[1] + w[2, 3] * \
y[2], w[1, 4] * y[1] + w[2, 4] * y[2]
y[3], y[4] = f(z[3]), f(z[4])
z[5] = w[3, 5] * y[3] + w[4, 5] * y[4]
y[5] = f(z[5])
e = E(y[5], ya)
print(f'yp = {y[5]}, Error = {e}')
return e
def update(n, g):
"""update weights"""
w[n] -= g
def back():
"""back propagation"""
g = d_E(y[5], ya) * d_f(z[5]) * y[3]
update((3, 5), g)
g = d_E(y[5], ya) * d_f(z[5]) * y[4]
update((4, 5), g)
g = d_E(y[5], ya) * d_f(z[5]) * w[3, 5] * d_f(z[3]) * y[1]
update((1, 3), g)
g = d_E(y[5], ya) * d_f(z[5]) * w[3, 5] * d_f(z[3]) * y[2]
update((2, 3), g)
g = d_E(y[5], ya) * d_f(z[5]) * w[4, 5] * d_f(z[4]) * y[1]
update((1, 4), g)
g = d_E(y[5], ya) * d_f(z[5]) * w[4, 5] * d_f(z[4]) * y[2]
update((2, 4), g)
if __name__ == '__main__':
step = 1
while True:
print(f'step = {step}:', end=' ')
if forward() < 1e-7:
break
step += 1
back()
运行代码,发现迭代111次后,损失才达到预期。
➜ python3 main.py
step = 1: yp = 0.6902834929076443, Error = 0.01810390383656677
step = 2: yp = 0.6820312027460466, Error = 0.016567679386586154
step = 3: yp = 0.6739592936119999, Error = 0.015130917916992996
step = 4: yp = 0.6660860653430867, Error = 0.013792290550574028
...
...
...
step = 110: yp = 0.500455314229855, Error = 1.036555239542297e-07
step = 111: yp = 0.5004302844701667, Error = 9.2572362633313e-08
动量法优化
def update(n, g):
"""update weights"""
lr = 0.1
beta = 0.999
epsilon = 0.01
s[n] = beta * s[n] + (1 - beta) * g**2
w[n] -= lr / (sqrt(s[n]) + epsilon) * g
代码如上,使用动量法优化后,仅迭代11步损失就达到了预期。
➜ python3 main.py
step = 1: yp = 0.6902834929076443, Error = 0.01810390383656677
step = 2: yp = 0.6118790538896405, Error = 0.006258461349620539
step = 3: yp = 0.5593494607066376, Error = 0.0017611792430843605
step = 4: yp = 0.5301730589054645, Error = 0.00045520674185631563
step = 5: yp = 0.5151484953395415, Error = 0.00011473845552605596
step = 6: yp = 0.5075810359788381, Error = 2.873605325621872e-05
step = 7: yp = 0.5037909668833773, Error = 7.185714955431785e-06
step = 8: yp = 0.5018953359787233, Error = 1.7961492361214424e-06
step = 9: yp = 0.5009475298860605, Error = 4.48906442488917e-07
step = 10: yp = 0.5004736747645289, Error = 1.1218389127573745e-07
step = 11: yp = 0.5002367820786155, Error = 2.803287637675012e-08
使用 PyTorch 实现
import torch
import torch.nn as nn
class Net(nn.Module):
def __init__(self) -> None:
super(Net, self).__init__()
hidden = nn.Linear(2, 2, bias=False)
output = nn.Linear(2, 1, bias=False)
hidden.weight.data = torch.tensor([[0.1, 0.8], [0.4, 0.6]])
output.weight.data = torch.tensor([0.3, 0.9])
self.net = nn.Sequential(
hidden,
nn.Sigmoid(),
output,
nn.Sigmoid()
)
def forward(self, x):
return self.net(x)
def MSE(pred, act):
return 1/2*torch.mean((pred - act)**2)
device = "cuda" if torch.cuda.is_available() else "cpu"
x = torch.tensor([0.35, 0.9]).to(device)
y = torch.tensor(0.5).to(device)
net = Net().to(device)
optimizer = torch.optim.RMSprop(
net.parameters(), lr=0.1, alpha=0.999, eps=1e-2)
def train():
optimizer.zero_grad()
pred = net(x)
loss = MSE(pred, y)
loss.backward()
optimizer.step()
err = loss.detach().item()
print(f'pred = {pred}, error = {err}')
return err
if __name__ == '__main__':
step = 1
while True:
print(f'step = {step}:', end=' ')
if train() < 1e-7:
break
step += 1
References
https://zh.m.wikipedia.org/zh-cn/%E5%8F%8D%E5%90%91%E4%BC%A0%E6%92%AD%E7%AE%97%E6%B3%95