INVALID HTML
It's obvious that not every web publisher pays much attention to validity of his HTML code. Though most of the browsers are able to digest a broken markup, when you do web scraping some mistakes in web pages may result in scraping errors preventing you from getting relevant results.
To test web scrapers against invalid markup we suggest scraping this page that contains the following markup mistakes:
- Unescaped characters (& and > instead of & and >)
- Non-HTML tags (<nonHTML>)
- Unclosed tags (<span<span/>)
- Unmatched quotes (<a href="scrapetools.com'>)
- Missed spaces (<a id="test"href="scrapetools.com">)
- Invalid tag nesting (<div><span></div></span>)
- The charset specified in META tag or HTTP header in does not match the real document encoding
In other words, after scraping the invalid HTML presented below the scraper should output the following values:
- 2>1 & 1<2
- nonHTML
- unclosed
- scrapetools.com
- millepah.com
- bad nesting
- проверка (windows-1251) wrong meta
- проверка (utf-8) wrong header
Here is the invalid HTML itself:
2>1 & 1<2
nonHTML
unclosed
bad nesting
проверка (windows-1251) wrong meta
проверка (utf-8) wrong header