pandas 数据框:基于列和时间范围的重复

pandas dataframe: duplicates based on column and time range( pandas 数据框:基于列和时间范围的重复)
本文介绍了 pandas 数据框:基于列和时间范围的重复的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着跟版网的小编来一起学习吧!

问题描述

我有一个(非常简单的)熊猫数据框,看起来像这样:

I have a (very simplyfied here) pandas dataframe which looks like this:

df

    datetime             user   type   msg
0  2012-11-11 15:41:08   u1     txt    hello world
1  2012-11-11 15:41:11   u2     txt    hello world
2  2012-11-21 17:00:08   u3     txt    hello world
3  2012-11-22 18:08:35   u4     txt      hello you
4  2012-11-22 18:08:37   u5     txt      hello you

我现在想做的是获取所有时间戳在 3 秒内的重复消息.期望的输出是:

What I would like to do now is to get all the duplicate messages which have their timestamp within 3 seconds. The desired output would be:

   datetime              user   type   msg
0  2012-11-11 15:41:08   u1     txt    hello world
1  2012-11-11 15:41:11   u2     txt    hello world
3  2012-11-22 18:08:35   u4     txt      hello you
4  2012-11-22 18:08:37   u5     txt      hello you

没有第三行,因为它的文本与第一行和第二行相同,但它的时间戳不是3秒以内.

without the third row, as its text is the same as in row one and two, but its timestamp is not within the range of 3 seconds.

我尝试将列 datetime 和 msg 定义为 duplicate() 方法的参数,但它返回一个空数据帧,因为时间戳不相同:

I tried to define the columns datetime and msg as parameters for the duplicate() method, but it returns an empty dataframe because the timestamps are not identical:

mask = df.duplicated(subset=['datetime', 'msg'], keep=False)

print(df[mask])
Empty DataFrame
Columns: [datetime, user, type, msg, MD5]
Index: []

有没有一种方法可以为我的日期时间"参数定义一个范围?为了说明,某事喜欢:

Is there a way where I can define a range for my "datetime" parameter? To illustrate, something like:

mask = df.duplicated(subset=['datetime_between_3_seconds', 'msg'], keep=False)

我们将一如既往地为您提供任何帮助.

Any help here would as always be very much appreciated.

推荐答案

这段代码给出了预期的输出

This Piece of code gives the expected output

df[(df.groupby(["msg"], as_index=False)["datetime"].diff().fillna(0).dt.seconds <= 3).reset_index(drop=True)]

我已对数据框的msg"列进行分组,然后选择该数据框的日期时间"列并使用内置函数 差异.Diff 函数查找该列的值之间的差异.用零填充 NaT 值并仅选择那些值小于 3 秒的索引.

I have grouped on "msg" column of dataframe and then selected "datetime" column of that dataframe and used inbuilt function diff. Diff function finds the difference between values of that column. Filled the NaT values with zero and selected only those indexes which have values less than 3 seconds.

在使用上述代码之前,请确保您的数据框按日期时间升序排序.

Before using above code make sure that your dataframe is sorted on datetime in ascending order.

这篇关于 pandas 数据框:基于列和时间范围的重复的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!

本站部分内容来源互联网,如果有图片或者内容侵犯您的权益请联系我们删除!

相关文档推荐

python: iterating through a dictionary with list values(python:遍历具有列表值的字典)
What is the difference between chain and chain.from_iterable in itertools?(itertools中chain和chain.from_iterable有什么区别?)
python JSON only get keys in first level(python JSON只获取第一级的键)
Iterate over n successive elements of list (with overlapping)(迭代列表的 n 个连续元素(重叠))
Loop problem while iterating through a list and removing recurring elements(遍历列表并删除重复元素时出现循环问题)
Elegant way to skip elements in an iterable(跳过可迭代元素的优雅方式)