HandsomeSoup is the library I wish I had when I started parsing HTML in Haskell.
It is built on top of HXT and adds a few functions that make is easier to work with HTML.
Most importantly, it adds CSS selectors to HXT. The goal of HandsomeSoup is to be a complete CSS2 parser for HXT (it is very close to this right now).
cabal install HandsomeSoup
Nokogiri, the HTML parser for Ruby, has an example showing how to scrape Google search results. This is easy in HandsomeSoup:
main = do doc <- fromUrl "http://www.google.com/search?q=egon+schiele" links <- runX $ doc >>> css "h3.r a" ! "href" mapM_ putStrLn links
What can HandsomeSoup do for you?
Easily parse an online page using
doc <- fromUrl "http://example.com"
Or a local page using
contents <- readFile [filename] doc <- parseHtml contents
Easily extract elements using
Here are some valid selectors:
doc <<< css "a" doc <<< css "*" doc <<< css "a#link1" doc <<< css "a.foo" doc <<< css "p > a" doc <<< css "#container h1" doc <<< css "img[width]" doc <<< css "img[width=400]" doc <<< css "a[class~=bar]"
Easily get attributes using
doc <<< css "img" ! "src" doc <<< css "a" ! "href"