python - How to parse the following multi-dimensional xml-like data into a dataframe -
i have explored xml based api work-related things, comes warehouse data. ideally want analysis in python pandas.
aggregate(aggregate_dimension_value_list=[ dateaggregatedimensionvalue(value=datetime.datetime(2013, 8, 28, 19, 30, tzinfo= utc )) , none, stringaggregatedimensionvalue(value=u'virtually_labeled_case') ], quantity=127) , aggregate(aggregate_dimension_value_list=[ dateaggregatedimensionvalue(value=datetime.datetime(2013, 8, 28, 19, 30, tzinfo= utc )) , stringaggregatedimensionvalue(value=u'pptransmergenoncon') , stringaggregatedimensionvalue(value=u'prime_bin_random_stow') ], quantity=15) aggregate(aggregate_dimension_value_list=[ dateaggregatedimensionvalue(value=datetime.datetime(2013, 8, 27, 21, 0, tzinfo= utc )) , stringaggregatedimensionvalue(value=u'pptransfra1') , stringaggregatedimensionvalue(value=u'prime_bin_random_stow') ], quantity=8) ,
the data looks above stream, after did find , replace in vim (i know can script in python). how best weird-format pandas? ideally want datetime, string aggregatedimension value, , quantity. there lot of none, in parse-needed data. in dataframe it'll easy analysis, i'm bit stumped here (and feel lot n00b).
edit: here unregexed , un-replaced data , want parse. isn't xml xml doesn't work.
[<dateaggregatedimensionvalue(value=datetime.datetime(2013, 8, 26, 20, 30, tzinfo=<utc >))>, <stringaggregatedimensionvalue(value=u'pptranscgn1')>, < stringaggregatedimensionvalue(value=u'prime_bin_random_stow')>], quantity=992)>, < stringaggregatedimensionvalue(value=u'pptranslej1')>, <stringaggregatedimensionvalue( value=u'prime_bin_random_stow')>], quantity=945)>, <aggregate( aggregate_dimension_value_list=[<dateaggregatedimensionvalue(value=datetime.datetime(2013 , 8, 23, 19, 30, tzinfo=<utc>))>, none, <stringaggregatedimensionvalue(value=u'tote')>], quantity=87)>, <aggregate(aggregate_dimension_value_list=[<dateaggregatedimensionvalue( value=datetime.datetime(2013, 8, 27, 17, 30, tzinfo=<utc>))>, < stringaggregatedimensionvalue(value=u'pptransmuc3')>, <stringaggregatedimensionvalue( value=u'tote')>], quantity=14)>, <aggregate(aggregate_dimension_value_list=[< dateaggregatedimensionvalue(value=datetime.datetime(2013, 8, 27, 20, 30, tzinfo=<utc >))>, <stringaggregatedimensionvalue(value=u'pptranseuk5')>, < stringaggregatedimensionvalue(value=u'prime_bin_random_stow')>], quantity=339)>, < aggregate(aggregate_dimension_value_list=[<dateaggregatedimensionvalue(value=datetime. datetime(2013, 8, 26, 20, 30, tzinfo=<utc>))>, <stringaggregatedimensionvalue(value=u 'pptranscgn1')>, <stringaggregatedimensionvalue(value=u'tote')>], quantity=1731)>, < aggregate(aggregate_dimension_value_list=[<dateaggregatedimensionvalue(value=datetime. datetime(2013, 8, 26, 19, 30, tzinfo=<utc>))>, <stringaggregatedimensionvalue(value=u 'pptranseuk5')>, quantity=444)>, <aggregate(aggregate_dimension_value_list=[< dateaggregatedimensionvalue(value=datetime.datetime(2013, 8, 26, 19, 30, tzinfo=<utc >))>, <stringaggregatedimensionvalue(value=u'pptranseuk5')>, < stringaggregatedimensionvalue(value=u'tote')>], quantity=28)>, <aggregate( aggregate_dimension_value_list=[<dateaggregatedimensionvalue(value=datetime.datetime(2013 , 8, 28, 19, 30, tzinfo=<utc>))>, <stringaggregatedimensionvalue(value=u'pptransory1')>, <stringaggregatedimensionvalue(value=u'prime_bin_random_stow')>], quantity=69)>, < aggregate(aggregate_dimension_value_list=<aggregate(aggregate_dimension_value_list=[< dateaggregatedimensionvalue(value=datetime.datetime(2013, 8, 26, 19, 30, tzinfo=<utc >))>, <stringaggregatedimensionvalue(value=u'pptransmad4')>, < stringaggregatedimensionvalue(value=u'prime_bin_random_stow')>], quantity=47)>, < aggregate(aggregate_dimension_value_list=[<dateaggregatedimensionvalue(value=datetime. datetime(2013, 8, 26, 21, 0, tzinfo=<utc>))>, none, none], quantity=78)>
if prefer more along lines of parser, here pyparsing stab @ problem:
from pyparsing import suppress,quotedstring,word,alphas,nums,alphanums,keyword,optional import datetime # define utc timezone sake of eval if hasattr(datetime,"timezone"): utc = datetime.timezone(datetime.timedelta(0),"utc") else: utc = none _ = suppress evaltokens = lambda s,l,t: eval(''.join(t)) timevalue = 'datetime.datetime' + quotedstring('(', endquotechar=')', unquoteresults=false) timevalue.setparseaction(evaltokens) strvalue = 'u' + quotedstring("'", unquoteresults=false) strvalue.setparseaction(evaltokens) nonevalue = keyword("none").setparseaction(lambda s,l,t: [none]) intvalue = word(nums).setparseaction(lambda s,l,t: int(t[0])) comma = optional(_(",")) valuedexpr = lambda expr: (word(alphas) + "(" + "value" + "=" + expr + ")").setparseaction(lambda t: t[4]) lineexpr = (_("aggregate(aggregate_dimension_value_list=[") + valuedexpr(timevalue)("timestamp") + comma + (nonevalue | valuedexpr(strvalue))("s1") + comma + (nonevalue | valuedexpr(strvalue))("s2") + comma + "]" + comma + "quantity=" + intvalue("qty"))
use lineexpr.searchstring
pull data out of each aggregate:
for data in lineexpr.searchstring(sample): print data.dump() print data.qty print
giving:
[datetime.datetime(2013, 8, 28, 19, 30), none, u'virtually_labeled_case', ']', 'quantity=', 127] - qty: 127 - s1: none - s2: virtually_labeled_case - timestamp: 2013-08-28 19:30:00 127 [datetime.datetime(2013, 8, 28, 19, 30), u'pptransmergenoncon', u'prime_bin_random_stow', ']', 'quantity=', 15] - qty: 15 - s1: pptransmergenoncon - s2: prime_bin_random_stow - timestamp: 2013-08-28 19:30:00 15 [datetime.datetime(2013, 8, 27, 21, 0), u'pptransfra1', u'prime_bin_random_stow', ']', 'quantity=', 8] - qty: 8 - s1: pptransfra1 - s2: prime_bin_random_stow - timestamp: 2013-08-27 21:00:00 8
dump()
show named results values available - note how quantity attribute can accessed directly using data.qty
. setup definition of results name "qty" in "quantity=" + intvalue("qty")
. timestamp
, s1
, , s2
can accessed similarly. (there still little eval
ing in this, cleaning left exercise reader.)
edit:
here modified pyparsing parser, process original raw xml-like stuff. changes pretty minor:
from pyparsing import suppress,quotedstring,word,alphas,nums,alphanums,keyword,optional, ungroup import datetime # define utc timezone sake of eval if hasattr(datetime,"timezone"): utc = datetime.timezone(datetime.timedelta(0),"utc") else: utc = none _ = suppress evaltokens = lambda s,l,t: eval(''.join(t)) timevalue = 'datetime.datetime' + quotedstring('(', endquotechar=')', unquoteresults=false) replutc = lambda s,l,t: ''.join(t).replace("< utc>","utc").replace("<utc >","utc").replace("<utc>","utc") timevalue.setparseaction(replutc, evaltokens) strvalue = 'u' + quotedstring("'", unquoteresults=false) strvalue.setparseaction(evaltokens) nonevalue = keyword("none").setparseaction(lambda s,l,t: [none]) intvalue = word(nums).setparseaction(lambda s,l,t: int(t[0])) comma = optional(_(",")) lt,gt,lpar,rpar,lbrack,rbrack = map(suppress,"<>()[]") #~ valuedexpr = lambda expr: (word(alphas) + "(" + "value" + "=" + expr + ")").setparseaction(lambda t: t[4]) valuedexpr = lambda expr: ungroup(lt + (word(alphas) + "(" + "value" + "=" + expr("value") + ")" + gt).setparseaction(lambda t: t.value)) #~ lineexpr = (_("aggregate(aggregate_dimension_value_list=[") + #~ valuedexpr(timevalue)("timestamp") + comma + #~ (nonevalue | valuedexpr(strvalue))("s1") + comma + #~ (nonevalue | valuedexpr(strvalue))("s2") + comma + #~ "]" + comma + #~ "quantity=" + intvalue("qty")) lineexpr = (lt + "aggregate" + lpar + "aggregate_dimension_value_list" + "=" + lbrack + valuedexpr(timevalue)("timestamp") + comma + (nonevalue | valuedexpr(strvalue))("s1") + comma + (nonevalue | valuedexpr(strvalue))("s2") + rbrack + comma + "quantity=" + intvalue("qty") + rpar + gt)
from pasted text (some of malformed), gives:
['aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 26, 20, 30), u'pptranscgn1', u'prime_bin_random_stow', 'quantity=', 992] - qty: 992 - s1: pptranscgn1 - s2: prime_bin_random_stow - timestamp: 2013-08-26 20:30:00 992 ['aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 23, 19, 30), none, u'tote', 'quantity=', 87] - qty: 87 - s1: none - s2: tote - timestamp: 2013-08-23 19:30:00 87 ['aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 27, 17, 30), u'pptransmuc3', u'tote', 'quantity=', 14] - qty: 14 - s1: pptransmuc3 - s2: tote - timestamp: 2013-08-27 17:30:00 14 ['aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 27, 20, 30), u'pptranseuk5', u'prime_bin_random_stow', 'quantity=', 339] - qty: 339 - s1: pptranseuk5 - s2: prime_bin_random_stow - timestamp: 2013-08-27 20:30:00 339 ['aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 26, 20, 30), u'pptranscgn1', u'tote', 'quantity=', 1731] - qty: 1731 - s1: pptranscgn1 - s2: tote - timestamp: 2013-08-26 20:30:00 1731 ['aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 26, 19, 30), u'pptranseuk5', u'tote', 'quantity=', 28] - qty: 28 - s1: pptranseuk5 - s2: tote - timestamp: 2013-08-26 19:30:00 28 ['aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 28, 19, 30), u'pptransory1', u'prime_bin_random_stow', 'quantity=', 69] - qty: 69 - s1: pptransory1 - s2: prime_bin_random_stow - timestamp: 2013-08-28 19:30:00 69 ['aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 26, 19, 30), u'pptransmad4', u'prime_bin_random_stow', 'quantity=', 47] - qty: 47 - s1: pptransmad4 - s2: prime_bin_random_stow - timestamp: 2013-08-26 19:30:00 47 ['aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 26, 21, 0), none, none, 'quantity=', 78] - qty: 78 - s1: none - s2: none - timestamp: 2013-08-26 21:00:00 78
Comments
Post a Comment