Web scraping done right using PHP

The art of gathering information is now made easy these days thanks to web services and APIs. But they don't cover everything, and that is why scrapping is a must. It is not easy to do it, given that there are many ways to do it and that web pages are not the same. Today i am going to show you the right way to scrap and extract any information from any website.

First thing to do in order to write a website scraper is to analyse a number of web pages seeking for a pattern. I do this using Google Chrome's developer tools. Then I write a piece of code that follows this pattern and extract the data. It may seem simple, but it is not. I have tried many ways like using parsing the web page as a Dom document and traverse it using XPath etc.  But i couldn't do much with it, it is just not that efficient. So i moved on to use Regex, it gave me quite more options and broader limits, but it is not structure aware, I mean it hard to handle it when things become too complicated. The best way i have found so far is jQuery-like CSS selectors!  Yes I am talking about PHP. I  was seeking for new options when i found the PQLite project. I want talk any longer, instead, I will move to the coding part!

Let's say we want to get the poster image of any given movie, having its imdb id. So we open this page and analyse it

http://www.imdb.com/title/tt1821694/

Then we inspect the poster image, right click on the poster and choose "inspect element"

inspector

the selected tag is our poster image. Let's analyse the hierarchy in which our element is found: a div with "image" class, a td with "img_primary" id and another div with "title-overview-widget" id. This should be enough to clearly and uniquely identify the poster image. If you are already familiar with jQuery, you should know that class names are prepended by a dot, while id names are prepended with a hashtag. Lets open the javascript console in the browser and write the following:

$("#title-overview-widget #img_primary .image img")

Hurray!  This line of code did give us the right element that holds our poster image. Now we are sure that we have the right jQuery selector. Let's move to the PHP part,  Go ahead and PQLite Project and  create a new php file within your (local) server and type the following:

<!--?php
$imdb_id = "tt1821694";
$imdb_url = 'http://www.imdb.com/title/'. $imdb_id . '/';
include "PQLite.php";
$pq = new PQLite(file_get_contents($imdb_url));
echo $pq->find("#title-overview-widget #img_primary .image img")
?>

This code echoes out the element that holds the poster image. it gives

TagArray Object [you may find methods associated with it at http://pqlite.com/docs/]

Something is wrong, right? this is because we tried to echo out a list of html tags, at least that is how PQLite treats it. We are sure that we will get a unique element. Besides, we want to have only the src attribute. So we change the last line of code with the following

echo $pq->find("#title-overview-widget #img_primary .image img")->get(0)->getAttr("src");

That should echo out a url like this

http://ia.media-imdb.com/images/M/MV5BMjI2ODQ4ODY3Nl5BMl5BanBnXkFtZTcwNTc2NzE1OQ@@._V1_SX214_.jpg

Yeah we have do it right and easy! Go ahead and change the imdb_id to any other movie id and watch how these few lines can work on any imdb page. Of course you can go further with this and make a lot of more useful examples. You may share your work with us. Follow me for more various tricks and tutorials!

blog comments powered by Disqus