apache spark - PySpark Create Multi Indexed Paired RDD with function -


a little while ago asked question organizing , structuring rdd multiple keys. see pyspark suggestion on how organize rdd

each object in current rdd contains start_time, end_time, id, , position.

i want group objects id , time. group 2 or more objects if have both same id or overlapping times.

the logic finding overlap pretty easy:

if x1.start_time > x2.start_time , x1.start_time < x2.end_time if x2.start_time > x1.start_time , x2.start_time < x2.end_time 

i don't quite how go creating paired rdd logic though.

any suggestions appreciated, thank you!

i think simplest way join on id , filter result (if there aren't many same id). i'd start mapping rdds (id, record) , doing join.


Comments

Popular posts from this blog

python - TypeError: start must be a integer -

c# - DevExpress RepositoryItemComboBox BackColor property ignored -

django - Creating multiple model instances in DRF3 -