用 WordNet 实现中文情感分析

#自然语言处理

1. 分析

中文的情感分析可以用词林做,词林有一大类(G类)对应心理活动,但是相对于 wordnet 还是太简单了.因此使用 nltk+wordnet 的方案,如下:

  1. 中文分词:结巴分词

  2. 中英文翻译:wordnet 汉语开放词网,可从以下网址下载:

http://compling.hss.ntu.edu.sg/cow/

  1. 情感分析:wordnet 的 sentiwordnet 组件

  2. 停用词:参考以下网页,另外加入常用标点符号
    [http://blog.csdn.net/u010533386/article/details/51458591

](http://blog.csdn.net/u010533386/article/details/51458591)

2. 代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
# encoding=utf-8
import jieba
import sys
import codecs

reload(sys)

import nltk
from nltk.corpus import wordnet as wn
from nltk.corpus import sentiwordnet as swn

sys.setdefaultencoding('utf8')

def doSeg(filename) :
f = open(filename, 'r+')
file_list = f.read()
f.close()

seg_list = jieba.cut(file_list)

stopwords = []
for word in open("./stop_words.txt", "r"):
stopwords.append(word.strip())

ll = []
for seg in seg_list :
if (seg.encode("utf-8") not in stopwords and seg != ' ' and seg != '' and seg != "\n" and seg != "\n\n"):
ll.append(seg)
return ll

def loadWordNet():
f = codecs.open("./cow-not-full.txt", "rb", "utf-8")
known = set()
for l in f:
if l.startswith('#') or not l.strip():
continue
row = l.strip().split("\t")
if len(row) == 3:
(synset, lemma, status) = row
elif len(row) == 2:
(synset, lemma) = row
status = 'Y'
else:
print "illformed line: ", l.strip()
if status in ['Y', 'O' ]:
if not (synset.strip(), lemma.strip()) in known:
known.add((synset.strip(), lemma.strip()))
return known

def findWordNet(known, key):
ll = [];
for kk in known:
if (kk[1] == key):
ll.append(kk[0])
return ll

def id2ss(ID):
return wn._synset_from_pos_and_offset(str(ID[-1:]), int(ID[:8]))

def getSenti(word):
return swn.senti_synset(word.name())

if __name__ == '__main__' :
known = loadWordNet()
words = doSeg(sys.argv[1])

n = 0
p = 0
for word in words:
ll = findWordNet(known, word)
if (len(ll) != 0):
n1 = 0.0
p1 = 0.0
for wid in ll:
desc = id2ss(wid)
swninfo = getSenti(desc)
p1 = p1 + swninfo.pos_score()
n1 = n1 + swninfo.neg_score()
if (p1 != 0.0 or n1 != 0.0):
print word, '-> n ', (n1 / len(ll)), ", p ", (p1 / len(ll))
p = p + p1 / len(ll)
n = n + n1 / len(ll)
print "n", n, ", p", p

3. 待解决的问题

  1. 结巴分词与 wordnet chinese 中的词不能一一对应
    结巴分词虽然可以导入自定义的词典,但仍有些结巴分出的词,在 wordnet 找不到对应词义,比如"太后","童子",还有一些组合词如"很早已前","黄山"等等.大多是名词,需要进一步"学习".

临时的解决方案是:将其当作"专有名词"处理

  1. 一词多义/一义多词
    无论是情感分析,还是语义分析,中文或者英文,都需要解决词和义的对应问题.
    临时的解决方案是:找到该词的所有语义,取其平均的情感值.另外,结巴也可判断出词性作为进一步参考.

  2. 语义问题
    语义问题是最根本的问题,一方面需要分析句子的结构,另外也和内容也有关,尤其是长文章,经常会使用"先抑后扬""对比分析",这样就比较难以判断感情色彩了.

4. 参考

  1. Learning lexical scales:WordNet and SentiWordNet
    [http://compprag.christopherpotts.net/wordnet.html

](http://compprag.christopherpotts.net/wordnet.html)

  1. SentiWordNet Interface
    [http://www.nltk.org/howto/sentiwordnet.html

](http://www.nltk.org/howto/sentiwordnet.html)