python - How to parse the following multi-dimensional xml-like data into a dataframe -


i have explored xml based api work-related things, comes warehouse data. ideally want analysis in python pandas.

aggregate(aggregate_dimension_value_list=[ dateaggregatedimensionvalue(value=datetime.datetime(2013, 8, 28, 19, 30, tzinfo= utc )) , none,  stringaggregatedimensionvalue(value=u'virtually_labeled_case') ], quantity=127) ,    aggregate(aggregate_dimension_value_list=[ dateaggregatedimensionvalue(value=datetime.datetime(2013, 8, 28, 19, 30, tzinfo= utc )) ,  stringaggregatedimensionvalue(value=u'pptransmergenoncon') ,  stringaggregatedimensionvalue(value=u'prime_bin_random_stow') ], quantity=15)     aggregate(aggregate_dimension_value_list=[ dateaggregatedimensionvalue(value=datetime.datetime(2013, 8, 27, 21, 0, tzinfo= utc )) ,  stringaggregatedimensionvalue(value=u'pptransfra1') ,  stringaggregatedimensionvalue(value=u'prime_bin_random_stow') ], quantity=8) ,   

the data looks above stream, after did find , replace in vim (i know can script in python). how best weird-format pandas? ideally want datetime, string aggregatedimension value, , quantity. there lot of none, in parse-needed data. in dataframe it'll easy analysis, i'm bit stumped here (and feel lot n00b).

edit: here unregexed , un-replaced data , want parse. isn't xml xml doesn't work.

[<dateaggregatedimensionvalue(value=datetime.datetime(2013, 8, 26, 20, 30, tzinfo=<utc >))>, <stringaggregatedimensionvalue(value=u'pptranscgn1')>, < stringaggregatedimensionvalue(value=u'prime_bin_random_stow')>], quantity=992)>, < stringaggregatedimensionvalue(value=u'pptranslej1')>, <stringaggregatedimensionvalue( value=u'prime_bin_random_stow')>], quantity=945)>, <aggregate( aggregate_dimension_value_list=[<dateaggregatedimensionvalue(value=datetime.datetime(2013 , 8, 23, 19, 30, tzinfo=<utc>))>, none, <stringaggregatedimensionvalue(value=u'tote')>],  quantity=87)>, <aggregate(aggregate_dimension_value_list=[<dateaggregatedimensionvalue( value=datetime.datetime(2013, 8, 27, 17, 30, tzinfo=<utc>))>, < stringaggregatedimensionvalue(value=u'pptransmuc3')>, <stringaggregatedimensionvalue( value=u'tote')>], quantity=14)>, <aggregate(aggregate_dimension_value_list=[< dateaggregatedimensionvalue(value=datetime.datetime(2013, 8, 27, 20, 30, tzinfo=<utc >))>, <stringaggregatedimensionvalue(value=u'pptranseuk5')>, < stringaggregatedimensionvalue(value=u'prime_bin_random_stow')>], quantity=339)>, < aggregate(aggregate_dimension_value_list=[<dateaggregatedimensionvalue(value=datetime. datetime(2013, 8, 26, 20, 30, tzinfo=<utc>))>, <stringaggregatedimensionvalue(value=u 'pptranscgn1')>, <stringaggregatedimensionvalue(value=u'tote')>], quantity=1731)>, < aggregate(aggregate_dimension_value_list=[<dateaggregatedimensionvalue(value=datetime. datetime(2013, 8, 26, 19, 30, tzinfo=<utc>))>, <stringaggregatedimensionvalue(value=u 'pptranseuk5')>, quantity=444)>, <aggregate(aggregate_dimension_value_list=[< dateaggregatedimensionvalue(value=datetime.datetime(2013, 8, 26, 19, 30, tzinfo=<utc >))>, <stringaggregatedimensionvalue(value=u'pptranseuk5')>, < stringaggregatedimensionvalue(value=u'tote')>], quantity=28)>, <aggregate( aggregate_dimension_value_list=[<dateaggregatedimensionvalue(value=datetime.datetime(2013 , 8, 28, 19, 30, tzinfo=<utc>))>, <stringaggregatedimensionvalue(value=u'pptransory1')>,  <stringaggregatedimensionvalue(value=u'prime_bin_random_stow')>], quantity=69)>, < aggregate(aggregate_dimension_value_list=<aggregate(aggregate_dimension_value_list=[< dateaggregatedimensionvalue(value=datetime.datetime(2013, 8, 26, 19, 30, tzinfo=<utc >))>, <stringaggregatedimensionvalue(value=u'pptransmad4')>, < stringaggregatedimensionvalue(value=u'prime_bin_random_stow')>], quantity=47)>, < aggregate(aggregate_dimension_value_list=[<dateaggregatedimensionvalue(value=datetime. datetime(2013, 8, 26, 21, 0, tzinfo=<utc>))>, none, none], quantity=78)> 

if prefer more along lines of parser, here pyparsing stab @ problem:

from pyparsing import suppress,quotedstring,word,alphas,nums,alphanums,keyword,optional import datetime  # define utc timezone sake of eval if hasattr(datetime,"timezone"):     utc = datetime.timezone(datetime.timedelta(0),"utc") else:     utc = none  _ = suppress evaltokens = lambda s,l,t: eval(''.join(t))  timevalue = 'datetime.datetime' + quotedstring('(', endquotechar=')', unquoteresults=false) timevalue.setparseaction(evaltokens)  strvalue = 'u' + quotedstring("'", unquoteresults=false) strvalue.setparseaction(evaltokens)  nonevalue = keyword("none").setparseaction(lambda s,l,t: [none]) intvalue = word(nums).setparseaction(lambda s,l,t: int(t[0]))  comma = optional(_(","))  valuedexpr = lambda expr: (word(alphas) + "(" + "value" + "=" + expr + ")").setparseaction(lambda t: t[4])  lineexpr = (_("aggregate(aggregate_dimension_value_list=[") +             valuedexpr(timevalue)("timestamp") + comma +             (nonevalue | valuedexpr(strvalue))("s1") + comma +             (nonevalue | valuedexpr(strvalue))("s2") + comma +         "]" + comma +         "quantity=" + intvalue("qty")) 

use lineexpr.searchstring pull data out of each aggregate:

for data in lineexpr.searchstring(sample):     print data.dump()     print data.qty     print 

giving:

[datetime.datetime(2013, 8, 28, 19, 30), none, u'virtually_labeled_case', ']', 'quantity=', 127] - qty: 127 - s1: none - s2: virtually_labeled_case - timestamp: 2013-08-28 19:30:00 127  [datetime.datetime(2013, 8, 28, 19, 30), u'pptransmergenoncon', u'prime_bin_random_stow', ']', 'quantity=', 15] - qty: 15 - s1: pptransmergenoncon - s2: prime_bin_random_stow - timestamp: 2013-08-28 19:30:00 15  [datetime.datetime(2013, 8, 27, 21, 0), u'pptransfra1', u'prime_bin_random_stow', ']', 'quantity=', 8] - qty: 8 - s1: pptransfra1 - s2: prime_bin_random_stow - timestamp: 2013-08-27 21:00:00 8 

dump() show named results values available - note how quantity attribute can accessed directly using data.qty. setup definition of results name "qty" in "quantity=" + intvalue("qty"). timestamp, s1, , s2 can accessed similarly. (there still little evaling in this, cleaning left exercise reader.)

edit:

here modified pyparsing parser, process original raw xml-like stuff. changes pretty minor:

from pyparsing import suppress,quotedstring,word,alphas,nums,alphanums,keyword,optional, ungroup import datetime  # define utc timezone sake of eval if hasattr(datetime,"timezone"):     utc = datetime.timezone(datetime.timedelta(0),"utc") else:     utc = none  _ = suppress evaltokens = lambda s,l,t: eval(''.join(t))  timevalue = 'datetime.datetime' + quotedstring('(', endquotechar=')', unquoteresults=false) replutc = lambda s,l,t: ''.join(t).replace("< utc>","utc").replace("<utc >","utc").replace("<utc>","utc") timevalue.setparseaction(replutc, evaltokens)  strvalue = 'u' + quotedstring("'", unquoteresults=false) strvalue.setparseaction(evaltokens)  nonevalue = keyword("none").setparseaction(lambda s,l,t: [none]) intvalue = word(nums).setparseaction(lambda s,l,t: int(t[0]))  comma = optional(_(",")) lt,gt,lpar,rpar,lbrack,rbrack = map(suppress,"<>()[]")  #~ valuedexpr = lambda expr: (word(alphas) + "(" + "value" + "=" + expr + ")").setparseaction(lambda t: t[4]) valuedexpr = lambda expr: ungroup(lt + (word(alphas) + "(" + "value" + "=" + expr("value") + ")" + gt).setparseaction(lambda t: t.value))  #~ lineexpr = (_("aggregate(aggregate_dimension_value_list=[") +             #~ valuedexpr(timevalue)("timestamp") + comma +             #~ (nonevalue | valuedexpr(strvalue))("s1") + comma +             #~ (nonevalue | valuedexpr(strvalue))("s2") + comma +         #~ "]" + comma +         #~ "quantity=" + intvalue("qty"))  lineexpr = (lt + "aggregate" + lpar + "aggregate_dimension_value_list" + "=" + lbrack +             valuedexpr(timevalue)("timestamp") + comma +             (nonevalue | valuedexpr(strvalue))("s1") + comma +             (nonevalue | valuedexpr(strvalue))("s2") +          rbrack + comma +         "quantity=" + intvalue("qty") + rpar + gt) 

from pasted text (some of malformed), gives:

['aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 26, 20, 30), u'pptranscgn1', u'prime_bin_random_stow', 'quantity=', 992] - qty: 992 - s1: pptranscgn1 - s2: prime_bin_random_stow - timestamp: 2013-08-26 20:30:00 992  ['aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 23, 19, 30), none, u'tote', 'quantity=', 87] - qty: 87 - s1: none - s2: tote - timestamp: 2013-08-23 19:30:00 87  ['aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 27, 17, 30), u'pptransmuc3', u'tote', 'quantity=', 14] - qty: 14 - s1: pptransmuc3 - s2: tote - timestamp: 2013-08-27 17:30:00 14  ['aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 27, 20, 30), u'pptranseuk5', u'prime_bin_random_stow', 'quantity=', 339] - qty: 339 - s1: pptranseuk5 - s2: prime_bin_random_stow - timestamp: 2013-08-27 20:30:00 339  ['aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 26, 20, 30), u'pptranscgn1', u'tote', 'quantity=', 1731] - qty: 1731 - s1: pptranscgn1 - s2: tote - timestamp: 2013-08-26 20:30:00 1731  ['aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 26, 19, 30), u'pptranseuk5', u'tote', 'quantity=', 28] - qty: 28 - s1: pptranseuk5 - s2: tote - timestamp: 2013-08-26 19:30:00 28  ['aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 28, 19, 30), u'pptransory1', u'prime_bin_random_stow', 'quantity=', 69] - qty: 69 - s1: pptransory1 - s2: prime_bin_random_stow - timestamp: 2013-08-28 19:30:00 69  ['aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 26, 19, 30), u'pptransmad4', u'prime_bin_random_stow', 'quantity=', 47] - qty: 47 - s1: pptransmad4 - s2: prime_bin_random_stow - timestamp: 2013-08-26 19:30:00 47  ['aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 26, 21, 0), none, none, 'quantity=', 78] - qty: 78 - s1: none - s2: none - timestamp: 2013-08-26 21:00:00 78 

Comments

Popular posts from this blog

java - activate/deactivate sonar maven plugin by profile? -

python - TypeError: can only concatenate tuple (not "float") to tuple -

java - What is the difference between String. and String.this. ? -