Open source HTML Parser of Java and Python

Apache Tika - a content analysis toolkit

The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.

jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.
scrape and parse HTML from a URL, file, or string
find and extract data, using DOM traversal or CSS selectors
manipulate the HTML elements, attributes, and text
clean user-submitted content against a safe white-list, to prevent XSS attacks
output tidy HTML

jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; jsoup will create a sensible parse tree.

HtmlCleaner is open-source HTML parser written in Java. HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. For any serious consumption of such documents, it is necessary to first clean up the mess and bring the order to tags, attributes and ordinary text. For the given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web browsers use in order to create Document Object Model. However, user may provide custom tag and rule set for tag filtering and balancing.

Scrapy at a glance

Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.

Even though Scrapy was originally designed for screen scraping (more precisely, web scraping), it can also be used to extract data using APIs (such as Amazon Associates Web Services) or as a general purpose web crawler.

See also:

Tools for web scraping

Comparison of HTML parsers

Popular posts from this blog

How to delete / clear queue of PowerMTA