PHP: xml_parser "Mismatched tag"-error when parsing HTML (auto-closing tags as )? -
PHP: xml_parser "Mismatched tag"-error when parsing HTML (auto-closing tags as <img>)? -
i want parse html using phps. used xml_parser it, can't cope auto-closing tags <img>
.
for example, next html snippet produces 'mismatched tag' error when reaches closing tag </a>
:
<a> <img src="url"><br> </a>
obviosly, reason is: xml_parser() doesn't know tags <img>
, <br>
not need closed (as self-closing automatically).
i know rewrite html <img src="url"/><br/>
create parser happy. however, want parser correctly process html correctly instead above variation valid html.
so either need tell parser - within onopeningtag - if tag auto-closing. possible somehow? alternative tell parser list of self-closing tag names. however, didn't find function that. might case 'html' isn't supported parser.
a acceptable solution might disable tag mismatch check @ (or implement html-compatible version myself).
however, there html-specific version in php overlooked. suggestions other simple parser implementations use?
here's have far:
<?php // command line parsing... $file = $argv[1]; // tag handler functions function onopeningtag($parser, $name, $attrs) { echo "open: $name\n"; } function onclosingtag($parser, $name) { echo "close: $name\n"; } function oncontent($parser, $text) { echo "text (len:".strlen($text).")\n"; } // parser... $xml_parser = xml_parser_create(); xml_set_element_handler($xml_parser, "onopeningtag", "onclosingtag"); xml_set_character_data_handler($xml_parser, "oncontent"); if (!($fp = fopen($file, "r"))) die("could not open file '$file'.\n"); while ($data = fread($fp, 4096)) { if (!xml_parse($xml_parser, $data, feof($fp))) { die(sprintf("xml error: %s @ line %d\n", xml_error_string(xml_get_error_code($xml_parser)), xml_get_current_line_number($xml_parser))); } } fclose($fp); xml_parser_free($xml_parser); ?>
you want parse html xml parser , prone cause headaches. xml far stricter html , you'll run problems this. if html not huge - tens of mbs, rather normal web page can utilize dom - http://php.net/manual/en/book.dom.php.
$dom = new domdocument(); $dom->loadhtml($html); $lists = $dom->getelementsbytagname('ul'); // bla bla bla
my suggestion seek specialised library html parsing. here suggesions:
https://github.com/symfony/domcrawler http://simplehtmldom.sourceforge.net/ https://code.google.com/p/ganon/may forcefulness you!
php html parsing xml-parsing html-parsing
Comments
Post a Comment