mongodb - Optimize loops with big datasets Python -
it's first time go big python need help.
i have mongodb (or python dict) following structure:
{ "_id": { "$oid" : "521b1fabc36b440cbe3a6009" }, "country": "brazil", "id": "96371952", "latitude": -23.815124482000001649, "longitude": -45.532670811999999216, "name": "coffee", "users": [ { "id": 277659258, "photos": [ { "created_time": 1376857433, "photo_id": "525440696606428630_277659258", }, { "created_time": 1377483144, "photo_id": "530689541585769912_10733844", } ], "username": "foo" }, { "id": 232745390, "photos": [ { "created_time": 1369422344, "photo_id": "463070647967686017_232745390", } ], "username": "bar" } ] }
now, want create 2 files, 1 summaries , other weight of each connection. loop works small datasets following:
#a dataset data = db.collection.find() =[i in data] #here go connections between locations edges = csv.writer(open("edges.csv", "wb")) #and here location data nodes = csv.writer(open("nodes.csv", "wb")) in a: #find users match q in a: if i['_id'] <> q['_id'] , q.get('users') : weight = 0 user_i in i['users']: user_q in q['users']: if user_i['id'] == user_q['id']: weight +=1 if weight>0: edges.writerow([ i['id'], q['id'], weight]) #find number of photos photos_number =0 p in i['users']: photos_number += len(p['photos']) nodes.writerow([ i['id'], i['name'], i['latitude'], i['longitude'], len(i['users']), photos_number ])
the scaling problems: have 20000 locations, each location might have 2000 users, each user might have around 10 photos.
is there more efficient way create above loops? maybe multithreads, jit, more indexes? because if run above in single thread can 20000^2 *2000 *10 results...
so how can handle more efficiently above problem? thanks
@yuchenxie , @paulmcguire's suggested microoptimizations aren't main problem, you're looping on 20,000 x 20,000 = 400,000,000 pairs of entries, , have inner loop of 2,000 x 2,000 user pairs. that's going slow.
luckily, inner loop can made faster pre-caching set
s of user ids in i['users']
, , replacing inner loop simple set intersection. changes o(num_users^2)
operation that's happening in python interpreter o(num_users)
operation happening in c, should help. (i timed lists of integers of size 2,000; on computer, went 156ms way you're doing 41µs way, 4,000x speedup.)
you can cut off half work of main loop on pairs of locations noticing relationship symmetric, there's no point in doing both i = a[1]
, q = a[2]
, i = a[2]
, q = a[1]
.
taking these , @paulmcguire's suggestions account, along other stylistic changes, code becomes (caveat: untested code ahead):
from itertools import combinations, izip data = db.collection.find() = list(data) user_ids = [{user['id'] user in i['users']} if 'users' in else set() in a] open("edges.csv", "wb") f: edges = csv.writer(f) (i, i_ids), (q, q_ids) in combinations(izip(a, user_ids), 2): weight = len(i_ids & q_ids) if weight > 0: edges.writerow([i['id'], q['id'], weight]) edges.writerow([q['id'], i['id'], weight]) open("nodes.csv", "wb") f: nodes = csv.writer(f) in a: nodes.writerow([ i['id'], i['name'], i['latitude'], i['longitude'], len(i['users']), sum(len(p['photos']) p in i['users']), # total number of photos ])
hopefully should enough of speedup. if not, it's possible @yuchenxie's suggestion help, though i'm doubtful because stdlib/os @ buffering kind of thing. (you might play buffering settings on file objects.)
otherwise, may come down trying core loops out of python (in cython or handwritten c), or giving pypy shot. i'm doubtful that'll huge speedups now, though.
you may able push hard weight calculations mongo, might smarter that; i've never used don't know.
Comments
Post a Comment