NLP文本向量化（含Python代碼）

520jefferson 2023-01-18 發(fā)布于北京

展開全文

作者Ctrl CV原載于知乎 https://zhuanlan.zhihu.com/p/597088538

人類語(yǔ)言具有高度模糊性,，一句話可能有多重的意思或隱喻,，而計(jì)算機(jī)當(dāng)前還無(wú)法真正理解語(yǔ)言或文字的意義。因此,，現(xiàn)階段的主要做法是先將語(yǔ)音和文字轉(zhuǎn)換成向量,，在對(duì)向量進(jìn)行分析或者使用深度學(xué)習(xí)建模。

本文目錄：
一,、常見的文本向量化方法
（1）one-hot詞向量表示
（2）詞袋模型 BOW
（3）詞頻-逆文檔頻率 TF-IDF
（4）N元模型 N-Gram
（5）單詞-向量模型 Word2vec
（6）文檔-向量模型 Doc2vec
（7）Glove模型
二,、Tensorflow 詞嵌入可視化工具

一、常見的文本向量化方法

（1）one-hot詞向量表示

又稱獨(dú)熱編碼,，將每個(gè)詞表示成具有n個(gè)元素的向量,，這個(gè)詞向量中只有一個(gè)元素是1，其他元素都是0,，不同詞匯元素為0的位置不同,，其中n的大小是整個(gè)語(yǔ)料中不同詞匯的總數(shù)。

# 導(dǎo)入keras中的詞匯映射器Tokenizer
from tensorflow.keras.preprocessing.text import Tokenizer
# 假定vocab為語(yǔ)料集所有不同詞匯集合
vocab = {'我', '愛', '北京', '天安門', '升國(guó)旗'}
# 實(shí)例化一個(gè)詞匯映射器對(duì)象
t = Tokenizer(num_words=None, char_level=False)
# 使用映射器擬合現(xiàn)有文本數(shù)據(jù)
t.fit_on_texts(vocab)

for token in vocab:
    zero_list = [0]*len(vocab)
    # 使用映射器轉(zhuǎn)化現(xiàn)有文本數(shù)據(jù), 每個(gè)詞匯對(duì)應(yīng)從1開始的自然數(shù)
    # 返回樣式如: [[2]], 取出其中的數(shù)字需要使用[0][0]
    token_index = t.texts_to_sequences([token])[0][0] - 1
    zero_list[token_index] = 1
    print(token, '的one-hot編碼為:', zero_list)

one-hot編碼缺點(diǎn)：完全割裂了詞與詞之間的聯(lián)系,，而且在大語(yǔ)料集下,，每個(gè)向量的長(zhǎng)度過(guò)大，占據(jù)大量?jī)?nèi)存,。

（2）詞袋模型 BOW

詞袋是指把一篇文章進(jìn)行詞匯的整理,，然后統(tǒng)計(jì)每個(gè)詞匯出現(xiàn)的次數(shù)，由前幾名的詞匯猜測(cè)全文大意,。

具體做法包括：

分詞：將整篇文章中的每個(gè)詞匯切開,，整理成生字表或字典。英文一般以空白或者句點(diǎn)隔開，中文需要通過(guò)特殊的方法進(jìn)行處理如jieba等,。
前置處理：先將詞匯做詞性還原,，轉(zhuǎn)換成小寫。詞性還原和轉(zhuǎn)換小寫都是為了避免,，詞匯統(tǒng)計(jì)出現(xiàn)分歧,。
去除停用詞：be動(dòng)詞、助動(dòng)詞,、介詞,、冠詞等不具有特殊意義的詞匯稱為停用詞在文章中是大量存在的，需要將它們剔除,，否則統(tǒng)計(jì)結(jié)果都是這些詞匯,。
詞頻統(tǒng)計(jì)：計(jì)算每個(gè)詞匯在文章中出現(xiàn)的次數(shù)，由高到低進(jìn)行排序,。

# coding=utf-8
import collections

stop_words = ['\n', 'or', 'are', 'they', 'i', 'some', 'by', '—',
              'even', 'the', 'to', 'a', 'and', 'of', 'in', 'on', 'for',
              'that', 'with', 'is', 'as', 'could', 'its', 'this', 'other',
              'an', 'have', 'more', 'at', 'don’t', 'can', 'only', 'most']

maxlen = 1000
word_freqs = collections.Counter()
# word_freqs = {}
# print(word_freqs)
with open('../data/NLP_data/news.txt', 'r+', encoding='utf8') as f:
    for line in f:
        words = line.lower().split(' ')
        if len(words) > maxlen:
            maxlen = len(words)

        for word in words:
            if not (word in stop_words):
                word_freqs[word] += 1
                # 詞頻統(tǒng)計(jì)
                # count = word_freqs.get(word, 0)
                # print(count)
                # word_freqs[word] = count + 1

# print(word_freqs)
print(word_freqs.most_common(20))

# 按照字典的value進(jìn)行排序
# a1 = sorted(word_freqs.items(), key=lambda x: x[1], reverse=True)
# print(a1[:20])
'''
[('stores', 15), ('convenience', 14), ('korean', 6), ('these', 6), ('one', 6), ('it’s', 6), ('from', 5), ('my', 5), ('you', 5), ('their', 5), ('just', 5), ('has', 5), ('new', 4), ('do', 4), ('also', 4), ('which', 4), ('find', 4), ('would', 4), ('like', 4), ('up', 4)]
'''

（3）詞頻-逆文檔頻率 TF-IDF

BOW 方法十分簡(jiǎn)單,，效果也不錯(cuò)，不過(guò)他有個(gè)缺點(diǎn),，有些詞匯不是停用詞,，但是在文章中經(jīng)常出現(xiàn)，但對(duì)全文并不重要,，比如only,、most等，對(duì)猜測(cè)全文大意沒有太多的幫助,，所以提出了改良算法TF-IDF,，他會(huì)針對(duì)跨文件常出現(xiàn)的詞匯給與較低的分?jǐn)?shù)，如only在每一個(gè)文件中都出現(xiàn)過(guò),，那么TF-IDF對(duì)他的評(píng)分就會(huì)很低,。

第一步：計(jì)算詞頻

考慮到文章有長(zhǎng)短之分，為了便于不同文章的比較,，進(jìn)行＂詞頻＂標(biāo)準(zhǔn)化,。

或者

第二步，計(jì)算逆文檔頻率,。

這時(shí),，需要一個(gè)語(yǔ)料庫(kù)（corpus），用來(lái)模擬語(yǔ)言的使用環(huán)境,。

如果一個(gè)詞越常見,，那么分母就越大，逆文檔頻率就越小越接近0,。分母之所以要加1,，是為了避免分母為0（即所有文檔都不包含該詞）,。log表示對(duì)得到的值取對(duì)數(shù),。

第三步,，計(jì)算TF-IDF。

可以看到,，TF-IDF與一個(gè)詞在文檔中的出現(xiàn)次數(shù)成正比,，與該詞在整個(gè)語(yǔ)言中的出現(xiàn)次數(shù)成反比。所以,，員動(dòng)調(diào)取關(guān)鍵同的法就很清楚了,，就是計(jì)算出文檔的每個(gè)詞的TF-IDF值，然后按降序排列,，取排在最前面的幾個(gè)詞,。

# TF-IDF匹配問答對(duì)
# coding=utf-8
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import numpy as np

corpus = [
    'This is the first document.',
    'This is the second document.',
    'And the third document.',
    'Is this the first document?'
]

vectorizer = CountVectorizer()
x = vectorizer.fit_transform(corpus)

word = vectorizer.get_feature_names()
print('Vocabulary:', word)

print(x.toarray())

# TF-IDF轉(zhuǎn)換
transfomers = TfidfTransformer()
tfidf = transfomers.fit_transform(x)
print(np.around(tfidf.toarray(), 4))

from sklearn.metrics.pairwise import cosine_similarity
# 比較最后一句與其他句子的相似度
print(cosine_similarity(tfidf[-1], tfidf[:-1], dense_output=False))

這里需要注意的是sklearn計(jì)算TF-IDF公式有些許區(qū)別：

手動(dòng)實(shí)現(xiàn)TF-IDF完整代碼：

注意：分子分母同時(shí)增加1 為了平滑處理、增加了歸一化處理計(jì)算平方根,。

# coding=utf-8
import math
import numpy

corpus = [
    'what is the weather like today',
    'what is for dinner tonight',
    'this is a question worth pondering',
    'it is a beautiful day today'
]
words = []
# 對(duì)corpus分詞
for i in corpus:
    words.append(i.split())


# 進(jìn)行詞頻統(tǒng)計(jì)
def Counter(word_list):
    wordcount = []
    for i in word_list:
        count = {}
        for j in i:
            if not count.get(j):
                count.update({j: 1})
            elif count.get(j):
                count[j] += 1
        wordcount.append(count)
    return wordcount


wordcount = Counter(words)

print(wordcount)


# 計(jì)算TF(word代表被計(jì)算的單詞,，word_list是被計(jì)算單詞所在文檔分詞后的字典)
def tf(word, word_list):
    return word_list.get(word) / sum(word_list.values())


# 統(tǒng)計(jì)含有該單詞的句子數(shù)
def count_sentence(word, wordcount):
    return sum(1 for i in wordcount if i.get(word))


# 計(jì)算IDF
def idf(word, wordcount):
    # return math.log(len(wordcount) / (count_sentence(word, wordcount) + 1))  # 10
    # return numpy.log(len(wordcount) / (count_sentence(word, wordcount) + 1))   # e
    return math.log((1 + len(wordcount)) / (count_sentence(word, wordcount) + 1)) + 1  # e


# 計(jì)算TF-IDF
def tfidf(word, word_list, wordcount):
    # print(word, idf(word, wordcount))
    return tf(word, word_list) * idf(word, wordcount)


p = 1

for i in wordcount:
    tf_idfs = 0
    print('part:{}'.format(p))
    p = p + 1
    for j, k in i.items():
        print('word: {} ---- TF-IDF:{}'.format(j, tfidf(j, i, wordcount)))

        # 歸一化
        tf_idfs += (tfidf(j, i, wordcount) ** 2)

    tf_idfs = tf_idfs ** 0.5
    print(tf_idfs)

    for j, k in i.items():
        print('歸一化后：word: {} ---- TF-IDF:{}'.format(j, tfidf(j, i, wordcount) / tf_idfs))

    # break

'''

part:1
word: what ---- TF-IDF:0.04794701207529681
word: is ---- TF-IDF:-0.03719059188570162
word: the ---- TF-IDF:0.11552453009332421
word: weather ---- TF-IDF:0.11552453009332421
word: like ---- TF-IDF:0.11552453009332421
word: today ---- TF-IDF:0.04794701207529681
part:2
word: what ---- TF-IDF:0.05753641449035617
word: is ---- TF-IDF:-0.044628710262841945
word: for ---- TF-IDF:0.13862943611198905
word: dinner ---- TF-IDF:0.13862943611198905
word: tonight ---- TF-IDF:0.13862943611198905
part:3
word: this ---- TF-IDF:0.11552453009332421
word: is ---- TF-IDF:-0.03719059188570162
word: a ---- TF-IDF:0.04794701207529681
word: question ---- TF-IDF:0.11552453009332421
word: worth ---- TF-IDF:0.11552453009332421
word: pondering ---- TF-IDF:0.11552453009332421
part:4
word: it ---- TF-IDF:0.11552453009332421
word: is ---- TF-IDF:-0.03719059188570162
word: a ---- TF-IDF:0.04794701207529681
word: beautiful ---- TF-IDF:0.11552453009332421
word: day ---- TF-IDF:0.11552453009332421
word: today ---- TF-IDF:0.04794701207529681

'''

（4）N元模型 N-Gram

給定一段文本序列，其中n個(gè)詞或字的相鄰共現(xiàn)特征即n-gram特征,，常用的n-gram特征是bi-gram和tri-gram特征,，分別對(duì)應(yīng)n為2和3。

# 一般n-gram中的n取2或者3, 這里取3為例
ngram_range = 3


def create_ngram_set(input_list):
    '''
    description: 從數(shù)值列表中提取所有的n-gram特征
    :param input_list: 輸入的數(shù)值列表, 可以看作是詞匯映射后的列表,
                       里面每個(gè)數(shù)字的取值范圍為[1, 25000]
    :return: n-gram特征組成的集合

    eg:
    # >>> create_ngram_set([1, 4, 9, 4, 1, 4])
    {(4, 9), (4, 1), (1, 4), (9, 4)}
    '''
    return set(zip(*[input_list[i:] for i in range(ngram_range)]))


if __name__ == '__main__':
    input_list = [1, 3, 2, 1, 5, 3]
    res = create_ngram_set(input_list)
    print(res)

（5）單詞-向量模型 Word2vec

BOW和TF-IDF都只著重于詞匯出現(xiàn)在文件中的次數(shù),，未考慮語(yǔ)言,、文字有上下文的關(guān)聯(lián)，針對(duì)上下文的關(guān)聯(lián),，Google研發(fā)團(tuán)隊(duì)提出了詞向量Word2vec,，將每個(gè)單子改以上下文表達(dá)，然后轉(zhuǎn)換為向量,，這就是詞嵌入（word embedding）,，與TF-IDF輸出的是稀疏向量不同，詞嵌入的輸出是一個(gè)稠密的樣本空間,。

詞向量的兩種做法：

# coding=utf-8
import gzip
import gensim

from gensim.test.utils import common_texts
# size：詞向量的大小,，window：考慮上下文各自的長(zhǎng)度
# min_count：?jiǎn)巫种辽俪霈F(xiàn)的次數(shù)，workers：執(zhí)行緒個(gè)數(shù)
model_simple = gensim.models.Word2Vec(sentences=common_texts, window=1,
                                      min_count=1, workers=4)
# 傳回 有效的字?jǐn)?shù)及總處理字?jǐn)?shù)
print(model_simple.train([['hello', 'world', 'michael']], total_examples=1, epochs=2))

sentences = [['cat', 'say', 'meow'], ['dog', 'say', 'woof']]

model_simple = gensim.models.Word2Vec(min_count=1)
model_simple.build_vocab(sentences)  # 建立生字表(vocabulary)
print(model_simple.train(sentences, total_examples=model_simple.corpus_count
                         , epochs=model_simple.epochs))


# 載入 OpinRank 語(yǔ)料庫(kù)：關(guān)於車輛與旅館的評(píng)論
data_file='../nlp-in-practice-master/word2vec/reviews_data.txt.gz'

with gzip.open (data_file, 'rb') as f:
    for i,line in enumerate (f):
        print(line)
        break


# 讀取 OpinRank 語(yǔ)料庫(kù),，並作前置處理
def read_input(input_file):
    with gzip.open (input_file, 'rb') as f:
        for i, line in enumerate (f):
            # 前置處理
            yield gensim.utils.simple_preprocess(line)

# 載入 OpinRank 語(yǔ)料庫(kù),，分詞
documents = list(read_input(data_file))
# print(documents)


print(len(documents))

# Word2Vec 模型訓(xùn)練，約10分鐘
model = gensim.models.Word2Vec(documents,
                               vector_size=150, window=10,
                               min_count=2, workers=10)
print(model.train(documents, total_examples=len(documents), epochs=10))


# 測(cè)試『骯髒』相似詞
w1 = 'dirty'
print(model.wv.most_similar(positive=w1))
# positive：相似詞


# 測(cè)試『禮貌』相似詞
w1 = ['polite']
print(model.wv.most_similar(positive=w1, topn=6))
# topn：只列出前 n 名


# 測(cè)試『法國(guó)』相似詞
w1 = ['france']
print(model.wv.most_similar(positive=w1, topn=6))
# topn：只列出前 n 名


# 測(cè)試『床,、床單,、枕頭』相似詞及『長(zhǎng)椅』相反詞
w1 = ['bed','sheet','pillow']
w2 = ['couch']
print(model.wv.most_similar(positive=w1, negative=w2, topn=10))
# negative：相反詞

# 比較兩詞相似機(jī)率
print(model.wv.similarity(w1='dirty', w2='smelly'))
print(model.wv.similarity(w1='dirty', w2='dirty'))

print(model.wv.similarity(w1='dirty', w2='clean'))

# 選出較不相似的字詞
print(model.wv.doesnt_match(['cat', 'dog', 'france']))

# 關(guān)鍵詞萃取(Keyword Extraction)
# https:///gensim_3.8.3/summarization/keywords.html
# from gensim.summarization import keywords


# # 測(cè)試語(yǔ)料
# text = '''Challenges in natural language processing frequently involve
# speech recognition, natural language understanding, natural language
# generation (frequently from formal, machine-readable logical forms),
# connecting language and machine perception, dialog systems, or some
# combination thereof.'''

# 關(guān)鍵詞萃取
# print(''.join(keywords(text)))

（6）文檔-向量模型 Doc2vec

Doc2vec模型是受到了Word2Vec模型的啟發(fā)。Word2Vec里預(yù)測(cè)詞向量時(shí),，預(yù)測(cè)出來(lái)的詞是含有詞義的,，Doc2vec中也是構(gòu)建了相同的結(jié)構(gòu),，所以Doc2vec克服了詞袋模型中沒有語(yǔ)義的缺點(diǎn)。假設(shè)現(xiàn)在存在訓(xùn)練樣本,，每個(gè)句子是訓(xùn)練樣本,，和Word2Vec一樣，Doc2vec也有兩種訓(xùn)練方式,，一種是分布記憶的段落向量（Distributed Memory Model of Paragraph Vectors , PV-DM）類似于Word2Vec中的CBOW模型,，另一種是分布詞袋版本的段落向量（Distributed Bag of Words version of Paragraph Vector，PV-DBOW）類似于Word2Vec中的Skip-gram模型,。

# coding=utf-8
import numpy as np
import nltk
import gensim
from gensim.models import word2vec
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn.metrics.pairwise import cosine_similarity

f = open('../data/FAQ/starbucks_faq.txt', 'r', encoding='utf8')
corpus = f.readlines()

print(corpus)

MAX_WORDS_A_LINE = 30
import string

print(string.punctuation)

stopword_list = set(nltk.corpus.stopwords.words('english')
                    + list(string.punctuation) + ['\n'])


# 分詞函數(shù)
def tokenize(text, stopwords, max_len=MAX_WORDS_A_LINE):
    return [token for token in gensim.utils.simple_preprocess(text
                                                              , max_len=max_len) if token not in stopwords]


# 分詞
document_tokens = []  # 整理後的字詞
for line in corpus:
    document_tokens.append(tokenize(line, stopword_list))

# 設(shè)定為 Gensim 標(biāo)籤文件格式
tagged_corpus = [TaggedDocument(doc, [i]) for i, doc in
                 enumerate(document_tokens)]

# 訓(xùn)練 Doc2Vec 模型
model_d2v = Doc2Vec(tagged_corpus, vector_size=MAX_WORDS_A_LINE, epochs=200)
model_d2v.train(tagged_corpus, total_examples=model_d2v.corpus_count,
                      epochs=model_d2v.epochs)

# 測(cè)試
questions = []
for i in range(len(document_tokens)):
    questions.append(model_d2v.infer_vector(document_tokens[i]))
questions = np.array(questions)
# print(questions.shape)

# 測(cè)試語(yǔ)句
# text = 'find allergen information'
# text = 'mobile pay'
text = 'verification code'
filtered_tokens = tokenize(text, stopword_list)
# print(filtered_tokens)

# 比較語(yǔ)句相似度
similarity = cosine_similarity(model_d2v.infer_vector(
    filtered_tokens).reshape(1, -1), questions, dense_output=False)

# 選出前 10 名
top_n = np.argsort(np.array(similarity[0]))[::-1][:10]
print(f'前 10 名 index:{top_n}\n')
for i in top_n:
    print(round(similarity[0][i], 4), corpus[i].rstrip('\n'))

（7）Glove模型

Glove由斯坦福大學(xué)所提出的另一套詞嵌入模型,，他們認(rèn)為Word2vec并未考慮全局的概率分布，只以移動(dòng)窗口內(nèi)的詞匯為樣本,，沒有掌握全文的信息,。因此，他們提出了詞匯共現(xiàn)矩陣,，考慮詞匯同時(shí)出現(xiàn)的概率,，解決Wordvec只看局部的缺陷以及BOW稀疏向量空間的問題。

# coding=utf-8
# 載入相關(guān)套件
import numpy as np

# 載入GloVe詞向量檔 glove.6B.300d.txt
'''
https://github.com/stanfordnlp/GloVe
'''
embeddings_dict = {}
with open('../data/glove/glove.6B.300d.txt', 'r', encoding='utf-8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], 'float32')
        embeddings_dict[word] = vector

# 隨意測(cè)試一個(gè)單字(love),，取得 GloVe 的詞向量
# print(embeddings_dict['love'])

# 字?jǐn)?shù)
# print(len(embeddings_dict.keys()))

# 以歐基里德(euclidean)距離計(jì)算相似性
from scipy.spatial.distance import euclidean


def find_closest_embeddings(embedding):
    return sorted(embeddings_dict.keys(),
                  key=lambda word: euclidean(embeddings_dict[word], embedding))


print(find_closest_embeddings(embeddings_dict['king'])[1:10])

# 任意選 100 個(gè)單字
# words = list(embeddings_dict.keys())[100:200]
# print(words)
words = find_closest_embeddings(embeddings_dict['king'])[1:10]

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# 以 T-SNE 降維至二個(gè)特徵
tsne = TSNE(n_components=2)
vectors = [embeddings_dict[word] for word in words]
Y = tsne.fit_transform(vectors)

# 繪製散佈圖,，觀察單字相似度
plt.figure(figsize=(12, 8))
plt.axis('off')
plt.scatter(Y[:, 0], Y[:, 1])
for label, x, y in zip(words, Y[:, 0], Y[:, 1]):
    plt.annotate(label, xy=(x, y), xytext=(0, 0), textcoords='offset points')

plt.show()

二、TensorFlow詞嵌入可視化工具

https://projector./

在右方的搜尋字段輸入單詞后,，系統(tǒng)就會(huì)顯示候選字,。
選擇其中一個(gè)候選字，系統(tǒng)會(huì)顯示相似字,，利用各種算法來(lái)進(jìn)行降維如：UMAP,、T-SNE、PCA等,。
選擇'Isolate 33 points'：顯示最近的33個(gè)單詞,。
修改不同的詞嵌入模型：Word2Vec All、Word2Vec 10k等,。

本站是提供個(gè)人知識(shí)管理的網(wǎng)絡(luò)存儲(chǔ)空間,，所有內(nèi)容均由用戶發(fā)布，不代表本站觀點(diǎn),。請(qǐng)注意甄別內(nèi)容中的聯(lián)系方式,、誘導(dǎo)購(gòu)買等信息，謹(jǐn)防詐騙,。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容,，請(qǐng)點(diǎn)擊一鍵舉報(bào)。

轉(zhuǎn)藏 分享

QQ空間 QQ好友新浪微博微信

獻(xiàn)花（0） +1

來(lái)自： 520jefferson > 《機(jī)器學(xué)習(xí)/深度學(xué)習(xí)/tensorflow》

舉報(bào)/認(rèn)領(lǐng)