PHP: xml_parser "Mismatched tag"-error when parsing HTML (auto-closing tags as )? -



PHP: xml_parser "Mismatched tag"-error when parsing HTML (auto-closing tags as <img>)? -

i want parse html using phps. used xml_parser it, can't cope auto-closing tags <img>.

for example, next html snippet produces 'mismatched tag' error when reaches closing tag </a>:

<a> <img src="url"><br> </a>

obviosly, reason is: xml_parser() doesn't know tags <img> , <br> not need closed (as self-closing automatically).

i know rewrite html <img src="url"/><br/> create parser happy. however, want parser correctly process html correctly instead above variation valid html.

so either need tell parser - within onopeningtag - if tag auto-closing. possible somehow? alternative tell parser list of self-closing tag names. however, didn't find function that. might case 'html' isn't supported parser.

a acceptable solution might disable tag mismatch check @ (or implement html-compatible version myself).

however, there html-specific version in php overlooked. suggestions other simple parser implementations use?

here's have far:

<?php // command line parsing... $file = $argv[1]; // tag handler functions function onopeningtag($parser, $name, $attrs) { echo "open: $name\n"; } function onclosingtag($parser, $name) { echo "close: $name\n"; } function oncontent($parser, $text) { echo "text (len:".strlen($text).")\n"; } // parser... $xml_parser = xml_parser_create(); xml_set_element_handler($xml_parser, "onopeningtag", "onclosingtag"); xml_set_character_data_handler($xml_parser, "oncontent"); if (!($fp = fopen($file, "r"))) die("could not open file '$file'.\n"); while ($data = fread($fp, 4096)) { if (!xml_parse($xml_parser, $data, feof($fp))) { die(sprintf("xml error: %s @ line %d\n", xml_error_string(xml_get_error_code($xml_parser)), xml_get_current_line_number($xml_parser))); } } fclose($fp); xml_parser_free($xml_parser); ?>

you want parse html xml parser , prone cause headaches. xml far stricter html , you'll run problems this. if html not huge - tens of mbs, rather normal web page can utilize dom - http://php.net/manual/en/book.dom.php.

$dom = new domdocument(); $dom->loadhtml($html); $lists = $dom->getelementsbytagname('ul'); // bla bla bla

my suggestion seek specialised library html parsing. here suggesions:

https://github.com/symfony/domcrawler http://simplehtmldom.sourceforge.net/ https://code.google.com/p/ganon/

may forcefulness you!

php html parsing xml-parsing html-parsing

Comments

Popular posts from this blog

Delphi change the assembly code of a running process -

json - Hibernate and Jackson (java.lang.IllegalStateException: Cannot call sendError() after the response has been committed) -

C++ 11 "class" keyword -