php - How can I efficiently identify the a natural division point between two sets of numbers? -
i have 2 datasets (a & b). each have 1000 numbers.
99% of time: < 5 <= b
however, 1% of time b < 5 < a.
if division point unknown - x
- how can 1 determine x
given dataset?
obviously max(a)
, min(b)
misleading. , i'd prefer not loop through entire range (or between min(b) , max(a)) guessing , identifying greatest probable division point.
sample dataset 1 1 1 2 b 2 <--anomoly 3 3 3 4 5 <--anomoly b 5 <--division, or `x` b 5 b 5 b 5 6 <--anomoly b 7 b 8 b 8 b 8 b 9 b 9 b 10 b 10
assume pair of datasets exists (c & d). how can find point c becomes d after allowing threshold of anomalies.
what recommend?
here's rough "guessing" strategy. i'd same without "guessing" loop.
$maxprobable = 0; $pointofdivision = 0; ($i = min($b); $i <= max($a); $i++) { // probability $i in_array($a) $countbelow = below($i,$a); // assume function returns count of $a items below $i $countabove = above($i,$b); // assume function returns count of $b items above $i $probbelow = $countbelow/count($a); $probabove = $countabove/count($b); if (($probbelow+$probabove) > $maxprobable) { $maxprobable = $probbelow+$probabove; $pointofdivision = $i; } } echo $pointofdivision;
this well-known problem in statistics , machine learning: given number of labeled datapoints, determine likeliest label new datapoint. in 1d case boils down determining threshold value x , saying "anything below x has label a" , "anything above x has label b."
there many algorithms: use example logistic regression, neural networks, or support vector machines. choice of algorithm depends on can assumed of data , on tools , libraries have available; example svm apparently tricky implement yourself.
if told how data generated or if comes known statistical distribution there might shortcut solution that's less complex still adequate.
Comments
Post a Comment