Friday, March 06, 2015

Spidering and Parsing

I needed to pull English product reviews from an English language version of a site and put them in Excel files for translation to Dutch, and rather than doing things one by one, I decided to code the work, resulting in this quick note on spidering and parsing in PHP.

For my purposes, PHPCrawl "just worked". I'll post again if I find something better, but so far, no need to look.

Rather than regexing the page contents, I wanted to to query the HTML. There are an AMAZING number of options, and the first two I tried "just broke" (one was Simple HTML DOM, and the other I'm not sure), since the pages I'm trying to parse are rather complicated.
Simple HTML DOM also had the additional disadvantage of being dead slow.

I am now working with the DOMDocument class, based on the comments on this excellent stack overflow post. So far, so good.

Update: This article by Ersin Kandemir was helpful as is, but additionally sent me to the XPath Helper Chrome extension (by Adam Sadovsky), which was also a big help (hint: not only does it show you xpath commands, it also lets you test your own commands on the current page). Thanks guys!