How to extract xml from log file to parse in python -
How to extract xml from log file to parse in python -
i have log file containing xml envelopes (2 types of xml structures: request , response). need parse file, extract xml-s , set them 2 arrays strings (1st array requests , 2nd array responses), can parse them later.
any ideas how can accomplish in python ?
snippet of log file parsed (log contains ):
2014-10-31 12:27:33,600 info recharger_mtelemedia2channel [mbpa.module.mgw.mtelemedia.mtbilling.mtsender][] sending bill request 2014-10-31 12:27:33,601 info recharger_mtelemedia2channel [mbpa.module.mgw.mtelemedia.mtbilling.mtsender][] <?xml version="1.0" encoding="utf-8"?> <request xmlns="xxx" xmlns:xsi="http://www.w3.org/2001/xmlschema-instance" xmlns:xsd="http://www.w3.org/2001/xmlschema"> <transactionheader> <username>xxx</username> <password>xxx</password> <time>31/10/2014 12:27:33</time> <clientreferencenumber>123</clientreferencenumber> <numberrequests>3</numberrequests> <information>description</information> <postbackurl>http://localhost/status</postbackurl> </transactionheader> <transactiondetails> <items> <item id="1" client="xxx1" keyword="test"/> <item id="2" client="xxx2" keyword="test"/> <item id="3" client="xxx3" keyword="test"/> </items> </transactiondetails> </request> 2014-10-31 12:27:34,487 info recharger_mtelemedia2channel [mbpa.module.mgw.mtelemedia.mtbilling.mtsender][] response code 200 bill request 2014-10-31 12:27:34,489 info recharger_mtelemedia2channel [mbpa.module.mgw.mtelemedia.mtbilling.mtsender][] <?xml version="1.0" encoding="utf-8"?> <response xmlns="xxx" xmlns:xsd="http://www.w3.org/2001/xmlschema" xmlns:xsi="http://www.w3.org/2001/xmlschema-instance"> <serverreferencenumber>xxx123xxx</serverreferencenumber> <clientreferencenumber>123</clientreferencenumber> <information>queued processing</information> <status>ok</status> </response>
many reply!
regards, robert
as both @paco , @lord_gestalter suggested, can utilize xml.etree
, replace non-xml elements file, this:
# utilize re substitute non-xml elements import re # utilize xml module parser import xml.etree.elementtree et # read file , store in string 's' open('yourfilehere','r') f: s = f.read() # remove non-xml element re # remove <?xml ...?> part file consists of multiple xml logs s = re.sub(r'<\?xml.*?>', '', ''.join(re.findall(r'<.*>', s))) # wrap s root element s = '<root>'+s+'</root>' # parse s elementtree tree = et.fromstring(s) tree <element 'root' @ 0x7f2ab877e190>
if don't care xml parser , want 'request' & 'response' string, utilize re.search
with open('yourfilehere','r') f: s = f.read() # set string of both request , response 'req' , 'res' # or need build improve re.search if have multiple requests, responses req = [re.search(r'<request.*\/request>', s).group()] res = [re.search(r'<response.*\/response>', s).group()] req ['<request xmlns="xxx" xmlns:xsi="http://www.w3.org/2001/xmlschema-instance" xmlns:xsd="http://www.w3.org/2001/xmlschema"><transactionheader><username>xxx</username><password>xxx</password><time>31/10/2014 12:27:33</time><clientreferencenumber>123</clientreferencenumber><numberrequests>3</numberrequests><information>description</information><postbackurl>http://localhost/status</postbackurl></transactionheader><transactiondetails><items><item id="1" client="xxx1" keyword="test"/><item id="2" client="xxx2" keyword="test"/><item id="3" client="xxx3" keyword="test"/></items></transactiondetails></request>'] res ['<response xmlns="xxx" xmlns:xsd="http://www.w3.org/2001/xmlschema" xmlns:xsi="http://www.w3.org/2001/xmlschema-instance"><serverreferencenumber>xxx123xxx</serverreferencenumber><clientreferencenumber>123</clientreferencenumber><information>queued processing</information><status>ok</status></response>']
python xml logging xml-parsing
Comments
Post a Comment