Pull Information from Websites with PHP
Screen Scrape Data from a Website with CURL
Lots of popular websites on the Internet let you export their data using XML format. If you want to pull your favorite pictures from Flickr automatically you can just use an RSS feed. But, what happens when you want to pull data from a website that doesn’t export the data in XML or other predictable format. If you have CURL installed and know a bit of PHP you can set up a script that will screen scrape the HTML to extract the data.
Screen scraping starts with downloading the contents of the page containing the data into either a string in memory or a file. Regular expressions are then used to extract the relevant data from the string or file. You can scrape almost any web site for data, the following code below scrapes the Metacritic DVD review page and displays the information in a nice table. The source of this script was taken from a really informative book called PHP Hacks written by Jack D. Herrington.
Before you download and test the script out make sure you have CURL installed with your PHP. If you are unsure, run phpinfo and search for CURL. In other words, open notepad, create a file called phpinfo.php and put this code into the file and upload to your server:
Run the file from your server and it will show your entire PHP configuration. If you have CURL installed you can use its functionality, if not, install it.
Now in order to scrape information from a website, the sections you want to extract need to be well defined by certain tags. Use View Source to see what the code looks like of the website you want to scrape and find the tag that contains the data you are requesting. The data I want to extract from Metacritic is contained within a div tag named “sortbyname1″ and the script below pulls that data from the external website.











