caching - Algorithm for Minimizing Number of Large Loads -


i have algorithm need compute pair-wise distances between large objects (where distance true metric, dist(a,b) == dist(b,a), among other metric requirements). have around thousand objects, need calculate 500,000 distances. there few issues:

  1. all of these 1000 objects serialized, sitting in individual files on disk.
  2. reading them disk huge operation compared relatively trivial distance calculation.
  3. i can't have of them in memory @ same time without swapping, , thrashing. can fit 500 in memory @ time.
  4. this means @ point need re-read object memory, after having read , released memory @ point previously.

so given reading disk bottleneck, , can't read in more half @ time, can think of algorithm reading , releasing these objects minimizes total number of reads?

i considered reading in first half, doing of pair-wise distances, releasing of memory , reading second half, , doing of pair-wise distances. @ point, still need distances between objects in first half , objects in second, , i'm not sure do. considered having cache that, when full, randomly chooses present object evict , reads next one, feel there has better option. considered lru, can lead behavior object required evicted on last calculation.

all-in-all, i'm kinda stuck. advice?

load points in first half. compute pairwise distances. keeping first half in memory, load points in second half 1 one, computing distances. load entire second half , compute pairwise distances. 1.5 reads per point, optimal in following model.

the model syntactically correct program task sequence of instructions of form load(i, j), meaning load point i memory location j. program correct if, every pair of points, there exists instruction after both points in memory.

it clear that, in correct program, every point loaded @ least once. assuming 1000 points , 500 locations, follows there @ least 500 evictions in favor of points being loaded first time. assume without loss of generality no point ever duplicated in memory. after point p evicted in favor of point q being loaded first time, necessary reload p in order distance between p , q computed. therefore, there @ least 500 reloads , hence @ least 1500 loads.


Comments

Popular posts from this blog

java - activate/deactivate sonar maven plugin by profile? -

python - TypeError: can only concatenate tuple (not "float") to tuple -

java - What is the difference between String. and String.this. ? -