用时间积累能力,创造服务产品

2024-03-30

陈东博客

用jieba分词库,对文章进行分词几词频统计代码如下

import docx import jieba.posseg import time import os from collections import Counter

Define the base directory

base_dir = r”F:\20231220\词性分析\20240330处理文档”

Define input and output file paths using os.path.join

input_file_path = os.path.join(base_dir, “音频版05遇到贵人的科学方法.docx”) output_file_path = os.path.join(base_dir, “音频版05遇到贵人的科学方法jieba.docx”)

读取输入docx文档

input_doc = docx.Document(input_file_path)

定义存储标注文本的列表

text = []

存储每段标注文本

tagged_text = “”

词频统计字典

adj_counts = Counter() adverb_counts = Counter() noun_counts = Counter() prep_counts = Counter() verb_counts = Counter() conj_counts = Counter()

遍历每段文本进行处理

for para in input_doc.paragraphs:

    ##### 使用jieba分词并标注词性
words = jieba.posseg.cut(para.text)

##### 遍历分词结果 Update Counters based on POS Tags:
for word, flag in words:

##### 标点符号,直接添加词本身 if flag == ‘x’: tagged_text += f”{word}”

    else:
        tagged_text += f"{word} /{flag}"

##### 根据词性统计词频
    if flag.startswith("a"):
         adj_counts[word] += 1
    elif flag.startswith("d"):
        adverb_counts[word] += 1
    elif flag.startswith("n"):
        noun_counts[word] += 1
    elif flag.startswith("p"):
        prep_counts[word] += 1
    elif flag.startswith("v"):
        verb_counts[word] += 1
    elif flag.startswith("c"):
        conj_counts[word] += 1
将标注文本添加到列表

text.append(tagged_text) tagged_text = “”

输出到docx文档

output_doc = docx.Document()

Output Tagged Text to Document:

for para in text: output_doc.add_paragraph(para)

Sorting Word Counts for Different Parts of Speech:

’’’ lambda x: x[1] is an anonymous (lambda) function in Python.

x is the input argument (a tuple in this case). x[1] extracts the second element of the tuple

’’’ sorted_adj = sorted(adj_counts.items(), key=lambda x: x[1], reverse=True) sorted_adverb = sorted(adj_counts.items(), key=lambda x: x[1], reverse=True) sorted_nouns = sorted(noun_counts.items(), key=lambda x: x[1], reverse=True) sorted_preps = sorted(prep_counts.items(), key=lambda x: x[1], reverse=True) sorted_verb = sorted(prep_counts.items(), key=lambda x: x[1], reverse=True) sorted_conj = sorted(conj_counts.items(), key=lambda x: x[1], reverse=True)

””” 所有介词的词频,而不限制top 10,可以直接遍历整个排序后的介词词频列表,不指定切片: output_doc.add_paragraph(“\n介词词频:”) for prep, count in sorted_preps[10]: output_doc.add_paragraph(f”{prep}:{count}”) 直接遍历 sorted_preps 这个已经按词频排序的列表,就可以输出所有的介词和词频,不再限制前10个。

”””

’’’ 如果想要限制输出词频最小值,可以添加if判断: output_doc.add_paragraph(“\n介词词频:”) for prep, count in sorted_preps: if count >= 5: output_doc.add_paragraph(f”{prep}:{count}”) ‘’’

Sort and Output Top 10 Nouns:
Sort and Output Adjectives, Prepositions, and Verbs:
output_doc.add_paragraph(“\n名词词频:”)
for noun, count in sorted_nouns:
output_doc.add_paragraph(f”{noun}: {count}”)

#####

output_doc.add_paragraph(“\nadj:”)
for adj, count in sorted_adj:
output_doc.add_paragraph(f”{adj}: {count}”)

#####

output_doc.add_paragraph(“\npreps:”)
for prep, count in sorted_preps:
output_doc.add_paragraph(f”{prep}: {count}”)

#####

output_doc.add_paragraph(“\nverb:”)
for verb, count in sorted_verb:
output_doc.add_paragraph(f”{verb}: {count}”)
Sort and Output Adjectives, Prepositions, and Verbs:
通过先把词和词频拼接成字符串,最后一次性输出字符串,实现了词频统计结果的连在一起的效果。

output_doc.add_paragraph(“\n1名词词频:”) noun_str = “” for noun, count in sorted_nouns:

在拼接字符串时,在词和词频之间添加了逗号,使得各词频统计结果之间用逗号分隔开。
noun_str += f"{noun} {count} ," output_doc.add_paragraph(noun_str)
输出形容词词频

output_doc.add_paragraph(“\n2形容词:”) adj_str = “” for adj, count in sorted_adj: adj_str += f”{adj} {count} ,” output_doc.add_paragraph(adj_str)

输出介词词频

output_doc.add_paragraph(“\n3介词:”) prep_str = “” for prep, count in sorted_preps: prep_str += f”{prep} {count} ,” output_doc.add_paragraph(prep_str)

输出动词词频

output_doc.add_paragraph(“\n4动词:”) verb_str = “” for verb, count in sorted_verb: verb_str += f”{verb} {count} ,” output_doc.add_paragraph(verb_str)

output_doc.add_paragraph(“\n5conjunction”) conj_str = “” for conj, count in sorted_conj: conj_str += f”{conj} {count} , “ output_doc.add_paragraph(conj_str)

output_doc.add_paragraph(“\n6adverb”) adverb_str = “” for adverb, count in sorted_adverb: adverb_str += f”{adverb} {count} , “ output_doc.add_paragrraph(adverb_str)

Save Output Document:

output_doc.save(output_file_path)

End the timer
End Timer and Display Execution Time:

end_time = time.time() time_taken = end_time - start_time

print(f”输出文件已保存到:{output_file_path}”) print(time_taken)