python - suggestion for processing huge amount of data on a single node system -
i have load huge amount of data(stackoverflow dump), process , generate files machine learning application.
operations are..
- parse xml file , load db(nosql or sql). use sax parser.
- do operations, group by(or aggregate) on database.
- then generate csv files. csv file generation on-demand. must not take more 2 minutes 10,000 records.
my constraints python language can use , should run on dual core cpu(no distributed system).
i tried using mysql , processing 21 million records took days(i created hell lot of indexes , aggregate operations). now, use mongodb, still takes lot of time.
can suggest me technology can use make faster(like time taken can 4/5 hours).
Comments
Post a Comment