快捷導(dǎo)航

pytorch/transformers?最后一層不加激活函數(shù)的原因分析

更新時(shí)間：2023年01月07日 10:27:44 作者：浪漫的數(shù)據(jù)分析

這里給大家解釋一下為什么bert模型最后都不加激活函數(shù)，是因?yàn)閾p失函數(shù)選擇的原因，本文通過(guò)示例代碼給大家介紹的非常詳細(xì)，對(duì)大家的學(xué)習(xí)或工作具有一定的參考借鑒價(jià)值，需要的朋友參考下吧

之前看bert及其各種變種模型，發(fā)現(xiàn)模型最后一層都是FC （full connect）的線性層Linear層，現(xiàn)在講解原因
實(shí)驗(yàn)：筆者試著在最后一層后加上了softmax激活函數(shù)，用來(lái)做多分類，發(fā)現(xiàn)模型無(wú)法收斂。去掉激活函數(shù)后收斂很好。
說(shuō)明加的不對(duì)，因此深入研究了一下。

前言

對(duì)于分類問題，pytorch最后一層為啥都是linear層，沒有激活函數(shù)？

一、原因在于損失方式CrossEntropy

CrossEntropy：該損失函數(shù)集成了log_softmax和nll_loss。因此，相當(dāng)于FC層后接上CrossEntropy，實(shí)際上是有經(jīng)過(guò)softmax處理的。只是內(nèi)置到損失函數(shù)CrossEntropy中去了。

This criterion combines `log_softmax` and `nll_loss` in a single
    function.

二、為什么CrossEntropy要用log_softmax而不是softmax

1.查看CrossEntropy定義：

在這里插入圖片描述

其中p為真實(shí)分布，q為預(yù)測(cè)分布。
根據(jù)CrossEntropyLoss公式，分類問題中，所以標(biāo)簽中只有一個(gè)類別（設(shè)為z)分量為1，其他類別全為0，我們代入公式，即求和之后只剩下一項(xiàng)。

在這里插入圖片描述

其中：

在這里插入圖片描述

是模型FC層輸出后需要接上softmax后，得到的概率。因此，這個(gè)公式就可以表示為：-log（softmax（FC的輸出）），因此，這里就直接變成一個(gè)函數(shù)，叫l(wèi)og_softmax，便于計(jì)算CrossEntropy。

2.如果想要的到模型輸出的概率值，需要在FC層輸出后，人為的接上F.Softmax()就好了

代碼如下（示例）：

import torch 
from torch.autograd import Variable 
import torch.nn.functional as F 
import matplotlib.pyplot as plt 

n_data = torch.ones(100,2) 
x0 = torch.normal(2*n_data, 1)
y0 = torch.zeros(100) 
x1 = torch.normal(-2*n_data, 1) 
y1 = torch.ones(100)

x = torch.cat((x0, x1), 0).type(torch.FloatTensor) # 組裝（連接） 
y = torch.cat((y0, y1), 0).type(torch.LongTensor)

x, y = Variable(x), Variable(y) 

class Net(torch.nn.Module):
    def __init__(self, n_feature, n_hidden, n_output):
        super(Net, self).__init__()
        self.hidden = torch.nn.Linear(n_feature, n_hidden)
        self.out = torch.nn.Linear(n_hidden, n_output)
    
    def forward(self, x):
        x = F.relu(self.hidden(x))
        x = self.out(x)
        return x

net = Net(2, 10, 2)

optimizer = torch.optim.SGD(net.parameters(), lr = 0.012)
for t in range(100):
    out = net(x)
    loss = torch.nn.CrossEntropyLoss()(out, y)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    if (t+1) % 20 == 0:
        plt.cla()
        prediction = torch.max(F.softmax(out), 1)[1] # 在第1維度取最大值并返回索引值 
        pred_y = prediction.data.numpy().squeeze()
        target_y = y.data.numpy()
        plt.scatter(x.data.numpy()[:, 0], x.data.numpy()[:,1], c=pred_y, s=100, lw=0, cmap='RdYlGn')
        accuracy = sum(pred_y == target_y)/200
        plt.text(1.5, -4, 'Accu=%.2f' % accuracy, fontdict={'size': 20, 'color': 'red'}) 
        plt.pause(0.1)

上述代碼中，F(xiàn).softmax(out)表示的就是模型輸出的概率。
torch.max(F.softmax(out), 1)[1] # 在第1維度取表示取概率最大的列最為預(yù)測(cè)標(biāo)簽值，不是概率，而是標(biāo)簽了。

3.bert模型的輸出端展示

代碼如下（示例）：

class Model(nn.Module):

    def __init__(self, config):
        super(Model, self).__init__()
        self.bert = BertModel.from_pretrained(config.bert_path)
        for param in self.bert.parameters():
            param.requires_grad = True
        self.fc = nn.Linear(config.hidden_size, config.num_classes)

    def forward(self, x):
        context = x[0]  # 輸入的句子
        mask = x[2]  # 對(duì)padding部分進(jìn)行mask，和句子一個(gè)size，padding部分用0表示，如：[1, 1, 1, 1, 0, 0]
        bert_out = self.bert(context, attention_mask=mask, output_hidden_states=False)
        out = self.fc(bert_out.pooler_output)
        return out

也可以看到，bert中的self.fc = nn.Linear(config.hidden_size, config.num_classes)僅僅為L(zhǎng)inear層，沒有激活函數(shù)。
如果想得到bert的多分類概率，最后在模型的out輸出后，需要接上一個(gè)
F.softmax(out)

總結(jié)

這里給大家解釋一下為什么bert模型最后都不加激活函數(shù)。是因?yàn)閾p失函數(shù)選擇的原因。

到此這篇關(guān)于pytorch/transformers 最后一層不加激活函數(shù)的原因的文章就介紹到這了,更多相關(guān)pytorch/transformers 不加激活函數(shù)內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章: