垃圾邮件分类-训练评估
模型我们选择使用朴素贝叶斯算法。
1. 模型训练
训练代码比较简单,代码执行结束之后,会在 data 目录下创建模型文件: 04-邮件分类模型.pth。
from sklearn.naive_bayes import MultinomialNB
import pickle
import joblib
if __name__ == '__main__':
# 加载训练数据
train_data = pickle.load(open('data/03-模型训练数据.pkl', 'rb'))
# 初始化模型
model = MultinomialNB()
model.fit(train_data['emails'], train_data['labels'])
# 模型存储
joblib.dump(model, 'data/04-邮件分类模型.pth')
2. 模型评估
我们将 1万+ 测试集数据送入模型,分别去评估模型的准确率、精度、召回率。
import pickle
import pandas as pd
import joblib
import os
import re
import zhconv
import jieba
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import accuracy_score
import jieba.posseg as psg
# 将当前目录设置为工作目录
os.chdir(os.path.dirname(os.path.abspath(__file__)))
def clean_data(email):
# 1. 去除非中文字符
email = re.sub(r'[^\u4e00-\u9fa5]', '', email)
# 2. 繁体转简体
email = zhconv.convert(email, 'zh-cn')
# 3. 邮件词性筛选
email_pos = psg.cut(email)
allow_pos = ['n', 'nr', 'ns', 'nt', 'v', 'a']
email = []
for word, pos in email_pos:
if pos in allow_pos:
email.append(word)
# 4. 转换成 str 类型
email = ' '.join(email)
return email
# 模型评估
def evaluate():
# 加载数据
test_data = pd.read_csv('data/01-原始邮件数据-测试集.csv')
# 特征提取器
vocab = pickle.load(open('data/03-模型训练特征.pkl', 'rb'))
transfer = CountVectorizer(vocabulary=vocab)
# 加载模型
model = joblib.load('data/04-邮件分类模型.pth')
# 测试集评估
y_pred = []
for email in test_data['emails'].to_numpy():
# 数据清洗
email = clean_data(email)
# 特征提取
email = transfer.transform([email]).toarray().tolist()
# 模型计算
output = model.predict(email)
# 存储结果
y_pred.append(output[0])
y_true = test_data['labels'].tolist()
# 样本数量
samples = len(test_data)
print('samples:', samples)
# 准确率
accuracy = accuracy_score(y_pred, y_true)
print('accuracy: %.2f' % accuracy)
# 精度
precision = precision_score(y_pred, y_true, pos_label='spam')
print('precision: %.2f' % precision)
# 召回率
recall = recall_score(y_pred, y_true, pos_label='spam')
print('recall: %.2f' % recall)
# 存储评估结果
eval_result = {
'sample_num': samples,
'accuracy': accuracy,
'precision': precision,
'recall': recall
}
pickle.dump(eval_result, open('data/05-模型评估结果.pkl', 'wb'), 3)
程序的输出结果:
samples: 12924
accuracy: 0.96
precision: 0.97
recall: 0.98
3. 模型预测
模型预测就是我们输入文本邮件内容,由模型给出最终的预测结果:spam 表示垃圾邮件,ham 表示非垃圾邮件。我们分别从测试集数据中,随意两封邮件:垃圾和非垃圾邮件来预测其分类。
def predict(email):
# 特征提取器
vocab = pickle.load(open('data/03-模型训练特征.pkl', 'rb'))
transfer = CountVectorizer(max_features=10000, vocabulary=vocab)
# 加载模型
model = joblib.load('data/04-邮件分类模型.pth')
# 数据清洗
email = clean_data(email)
# 特征提取
email = transfer.transform([email]).toarray().tolist()
# 模型计算
output = model.predict(email)
print('预测结果:', output[0])
if __name__ == '__main__':
email1 = '''
Received: from coozo.com ([219.133.254.230])
by spam-gw.ccert.edu.cn (MIMEDefang) with ESMTP id j8L2Zoqi028766
for <li@ccert.edu.cn>; Fri, 23 Sep 2005 13:01:45 +0800 (CST)
Message-ID: <200509211035.j8L2Zoqi028766@spam-gw.ccert.edu.cn>
From: "you" <you@coozo.com>
Subject: =?gb2312?B?us/X9w==?=
To: li@ccert.edu.cn
Content-Type: text/plain;charset="GB2312"
Content-Transfer-Encoding: 8bit
Date: Sun, 23 Oct 2005 23:44:32 +0800
X-Priority: 3
X-Mailer: Microsoft Outlook Express 6.00.2800.1106
您好!
我公司有多余的发票可以向外代开!(国税、地税、运输、广告、海关缴款书)。
如果贵公司(厂)有需要请来电洽谈、咨询!
联系电话: 013510251389 陈先生
谢谢
顺祝商祺!
'''
email2 = '''
Received: from web15010.mail.cnb.yahoo.com (web15010.mail.cnb.yahoo.com [202.165.103.67])
by spam-gw.ccert.edu.cn (MIMEDefang) with ESMTP id j8R8H2V8018468
for <hu@ccert.edu.cn>; Thu, 29 Sep 2005 19:39:41 +0800 (CST)
Received: (qmail 54688 invoked by uid 60001); Thu, 29 Sep 2005 11:50:48 -0000
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws;
s=s1024; d=yahoo.com.cn;
h=Message-ID:Received:Date:From:Subject:To:MIME-Version:Content-Type:Content-Transfer-Encoding;
b=bSU/zJOkkJfDLFBbWnWnTUKDZWedZej7CHwk+68TMJOxc5bWNOV3oFm+Sdj7+BguqbdY8hBnj9by0vLAREwvNsRCI/vWqZokpQhqNS620fenBohJKxF1JDhRipTl6dha0/sPi1Z9L+cjbm98QQkoNFkiZSBiuBy63tmjYznR3JE= ;
Message-ID: <20050927082809.54686.qmail@web15010.mail.cnb.yahoo.com>
Received: from [61.150.43.113] by web15010.mail.cnb.yahoo.com via HTTP; Thu, 29 Sep 2005 19:50:48 CST
Date: Thu, 29 Sep 2005 19:50:48 +0800 (CST)
From: liang ming <yang@yahoo.com.cn>
Subject: =?gb2312?B?UmU6ILOzvNzKscTQxfPT0cDPt62z9tLUx7C1xMrCx+k=?=
To: hu@ccert.edu.cn
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary="0-1710224003-1127809689=:53686"
Content-Transfer-Encoding: 8bit
我怎么觉得是你在翻..
标 题: 吵架时男朋友老翻出以前的事情
我觉得吵完了和好了就过去了,他却总是在下一次吵架的时候提起。是不是心胸不够宽
阔?老这样下去伤心死了。经常是吵完了我哭他不理我,后来太晚了他就搂着我拍拍我然
后天亮了我们都要去上班。昨天他说他想要的太多了,得到的太少了。我说我从来不觉得
我付出的少,他就质问我付出了什么。我为了他离开了以前的男朋友,办好了去日本的签
证而没去,离开了大连在这里辛苦的生活。远离了一些朋友,工资没怎么涨,每天忍受着
一个半小时的公交车,饭费房租都是2倍而房子却不是精装修也没有家电,忍受着电梯和
楼下汽车的噪音。听他那么说真伤心,觉得自己的爱在消减,好担心会不爱了。
--
'''
predict(email1)
predict(email2)
程序输出结果:
预测结果: spam
预测结果: ham