apache spark - PySpark Create Multi Indexed Paired RDD with function -
a little while ago asked question organizing , structuring rdd multiple keys. see pyspark suggestion on how organize rdd
each object in current rdd contains start_time
, end_time
, id
, , position
.
i want group objects id
, time
. group 2 or more objects if have both same id
or overlapping times.
the logic finding overlap pretty easy:
if x1.start_time > x2.start_time , x1.start_time < x2.end_time if x2.start_time > x1.start_time , x2.start_time < x2.end_time
i don't quite how go creating paired rdd logic though.
any suggestions appreciated, thank you!
i think simplest way join on id , filter result (if there aren't many same id). i'd start mapping rdds (id, record) , doing join.
Comments
Post a Comment