Wednesday, May 11, 2005

Java + XQuery + JTidy to parse HTML!

Occasionally I've written little programs to scrape webpages for useful information. In the past I've used various open-source libraries with varying satisfaction. My favorite approach so far is to use Ruby and it's HTML parsing libraries. Today I found an article published by IBM Developerworks that provides another approach that looks really cool: Java theory and practice: Screen-scraping with XQuery. The approach is to first use JTidy to cleanup the badly structured HTML and return the document and XML (4 lines of code). Then using XML tools like Saxon XQuery to easily query XML and reformat the data (several examples that were <10 lines of code, very readable!). Very nice solution!

No comments: