Pull Information from Websites with PHP

AddThis Social Bookmark Button

Screen Scrape Data from a Website with CURL

Lots of popular websites on the Internet let you export their data using XML format. If you want to pull your favorite pictures from Flickr automatically you can just use an RSS feed. But, what happens when you want to pull data from a website that doesn’t export the data in XML or other predictable format. If you have CURL installed and know a bit of PHP you can set up a script that will screen scrape the HTML to extract the data.
 
Screen scraping starts with downloading the contents of the page containing the data into either a string in memory or a file. Regular expressions are then used to extract the relevant data from the string or file. You can scrape almost any web site for data, the following code below scrapes the Metacritic DVD review page and displays the information in a nice table. The source of this script was taken from a really informative book called PHP Hacks written by Jack D. Herrington.
 
Before you download and test the script out make sure you have CURL installed with your PHP. If you are unsure, run phpinfo and search for CURL. In other words, open notepad, create a file called phpinfo.php and put this code into the file and upload to your server:

 
Run the file from your server and it will show your entire PHP configuration. If you have CURL installed you can use its functionality, if not, install it.
 
Now in order to scrape information from a website, the sections you want to extract need to be well defined by certain tags. Use View Source to see what the code looks like of the website you want to scrape and find the tag that contains the data you are requesting. The data I want to extract from Metacritic is contained within a div tag named “sortbyname1″ and the script below pulls that data from the external website.

 
Download it
 
Screen scraping has its downside. Technically, it will break when the website being scraped changes it format (for example, changes the name of the div tag you want to scrape). Also, it will break when the target site is unresponsive to web requests. Finally, it may be illegal for you to scrape a website so make sure you have permission to use the data before you screen scrape. If you have any problems with this script, let me know.

Popularity: 12%


Tags:

,

,


If you enjoyed this post, please consider to leave a comment.

Comments

Nice tip and instructions. I look forward to checking it out.

Awesome, let me know how it went.

[…] There are many ways you can do it with PHP, one excellent way of scraping websites with php is documented on TechRoam but there is even a simpler way. Let’s say you want to grab the contents of the Wikipedia page about screen scraping and you only want the first paragraph. The four lines below will do the job. If you encounter problems you can use comments below for discussion. Enjoy! […]

Leave a comment

(required)

(required)