列表性能中的Python模糊匹配字符串

Python Fuzzy matching strings in list performance(列表性能中的Python模糊匹配字符串)
本文介绍了列表性能中的Python模糊匹配字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着跟版网的小编来一起学习吧!

问题描述

我正在检查 4 个相同的数据框列中是否有类似的结果(模糊匹配),并且我有以下代码作为示例.当我将它应用到真正的 40.000 行 x 4 列数据集时,它会一直在 eternum 中运行.问题是代码太慢了.例如,如果我将数据集限制为 10 个用户,计算需要 8 分钟,而计算需要 20、19 分钟.有什么我想念的吗?我不知道为什么要花那么长时间.我希望在 2 小时或更短的时间内获得所有结果.任何提示或帮助将不胜感激.

I'm checking if there are similar results (fuzzy match) in 4 same dataframe columns, and I have the following code, as an example. When I apply it to the real 40.000 rows x 4 columns dataset, keeps running in eternum. The issue is that the code is too slow. For example, if I limite the dataset to 10 users, it takes 8 minutes to compute, while for 20, 19 minutes. Is there anything I am missing? I do not know why this take that long. I expect to have all results, maximum in 2 hours or less. Any hint or help would be greatly appreciated.

from fuzzywuzzy import process
dataframecolumn = ["apple","tb"]
compare = ["adfad","apple","asple","tab"]
Ratios = [process.extract(x,compare) for x in dataframecolumn]
result = list()
for ratio in Ratios:
    for match in ratio:
        if match[1] != 100:
            result.append(match)
            break
print (result) 

输出:[('asple', 80), ('tab', 80)]

Output: [('asple', 80), ('tab', 80)]

推荐答案

通过编写矢量化操作和避免循环来显着提高速度

Major speed improvements come by writing vectorized operations and avoiding loops

导入必要的包

from fuzzywuzzy import fuzz
import pandas as pd
import numpy as np

从第一个列表创建数据框

dataframecolumn = pd.DataFrame(["apple","tb"])
dataframecolumn.columns = ['Match']

从第二个列表创建数据框

compare = pd.DataFrame(["adfad","apple","asple","tab"])
compare.columns = ['compare']

Merge - 通过引入键(自连接)的笛卡尔积

dataframecolumn['Key'] = 1
compare['Key'] = 1
combined_dataframe = dataframecolumn.merge(compare,on="Key",how="left")
combined_dataframe = combined_dataframe[~(combined_dataframe.Match==combined_dataframe.compare)]

矢量化

def partial_match(x,y):
    return(fuzz.ratio(x,y))
partial_match_vector = np.vectorize(partial_match)

使用矢量化并通过设置阈值来获得所需的结果

combined_dataframe['score']=partial_match_vector(combined_dataframe['Match'],combined_dataframe['compare'])
combined_dataframe = combined_dataframe[combined_dataframe.score>=80]

结果

+--------+-----+--------+------+
| Match  | Key | compare | score
+--------+-----+--------+------+
| apple  | 1   |   asple |    80
|  tb    | 1   |   tab   |    80
+--------+-----+--------+------+

这篇关于列表性能中的Python模糊匹配字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!

本站部分内容来源互联网,如果有图片或者内容侵犯您的权益请联系我们删除!

相关文档推荐

python: iterating through a dictionary with list values(python:遍历具有列表值的字典)
What is the difference between chain and chain.from_iterable in itertools?(itertools中chain和chain.from_iterable有什么区别?)
python JSON only get keys in first level(python JSON只获取第一级的键)
Iterate over n successive elements of list (with overlapping)(迭代列表的 n 个连续元素(重叠))
Loop problem while iterating through a list and removing recurring elements(遍历列表并删除重复元素时出现循环问题)
Elegant way to skip elements in an iterable(跳过可迭代元素的优雅方式)