Web Scraping with PHP

Web scraping is an interesting thing to do. There is a lot of data on the web, and there are many interesting things that can be done with it if it is scraped and organized in more meaningful ways. There are many ways of scraping data, and you may choose the one that is best for what ever it is you are trying to do. From the simplest of ways, manually copy and pasting, to the more complex such as automatic link following and computer-simulated human interaction, web scraping is useful, interesting and fun.

Imagine you want to study plane ticket prices and how the fluctuate over time.You may want to just bookmark the site, and visit every day. Copy the price and paste it into a spread sheet. This is OK since you only need to get a price and you could set some kind of reminder to make sure you don’t forget to do it. But what happens when you want to get price for the same ticket but different travel agencies, or even different airlines? Lets say you want to compare 100 different agencies. Now copy and paste doesn’t seem like a good idea.

A few days ago I had to gather information on car dealers from a site that allowed you to find car dealers near you based on your zip code. I had to run all the US zip codes, and get all the information into a database. That definitely didn’t sound like something I wanted to do with copy and paste, so I did what programmers do best: let the computer do the work.

I think many programmers will agree when I say that programmer are inherently lazy. I’m not talking about programmers spending all the time like a couch potato, because in fact many programmers work day and night. What I mean is that we like to do as little work as possible to accomplish a task. That is why we program in the first place, because we want to automate tasks so that we don’t have to bother ourselves with repetitive tasks.

So, how do you program a web scraper? There are many ways to do it, but the basic idea is always the same: fetch a resource from the net, usually a web page, analyze the code searching for the data that is relevant to you. Save that data somewhere. That is all you really need to know to start scraping data. So, lets build a simple web scraper in php.

Before continuing, I’d like to mention that there are scraping solutions already made. There is software for scraping data, and there are libraries written for many languages that specialize on data scraping. However, for the sake of learning, we are going to code our scrapers here by hand.

What do we need?
We need php, and a way to interact with the DOM. If you read my previous entry where I talk about php and the DOM, you know of a few options to do that. We will also need an idea of what we want to scrape. Lets start with something simple. We will scrape the box office information from IMDB. We will only do it this one time, but in a real life situation you may want to set a cron job to scrape the data daily, weekly or at any other interval of time.

The first thing we need to know is the URL of the page from which we want to scrape the data. In this case it is http://www.imdb.com/chart/
Then we need to know how to find the data in which we are interested. For that, since we are going to be using the DOM, you can just look at the source code of the page. You will notice that the information that we want to scrape is in a tr element with a class name of either chart_odd_row or chart_even_row. That is what we will use to identify the information.

Now, let the fun begin.

Crete a file called box_office_scraper.php. I placed it inside a directory called IMDB. Open the file on your favorite editor, and get ready to type.

First, lets get the document from the web:


$url = "http://www.imdb.com/chart/";
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
$document = curl_exec($curl);

echo $document;

We declare a variable to hold the url of the document we want to fetch.
Then we initialize curl, passing in the url.
We set the CURLOPT_RETURNTRANSFER option so that curl returns a string containing the document rather than printing it.
Then we execute curl, and save the returned string in a variable.
Finally we echo the contents of the variable just to make sure we got everything right. You should now be seeing the same thing you would see if you visited the url directly. We echo the contents just to verify, but we don’t really want to echo the contents, so you can now delete that echo line.

Now that we have the document, it is time to search for the data we want, but first we need to create a DOM representation of the document we have. Continue editing you file:


$dom_rep = new DOMDocument;
$dom_rep->loadHTML($document);

If you reload your page now, you should no longer see the page that you were seeing before (provided you deleted the echo). Depending on your php configuration, you may, however, see a bunch of warnings. That is because the document is malformed. Personally, I prefer to turn those warnings off in this case. Usually, I like to have all error and warnings visible, so I can find ways to get rid of them by fixing whatever is causing them, but in this case those warning are only polluting my page. If you want to get rid of the warnings, just add this line at the top of the document, right after the opening php tag:


error_reporting(E_ERROR);

This way you tell php to only report errors. We want to be able to see when something goes wrong.

Now that we have a DOM representation of the document, we can start working with it. Since we know the data we want is in tr elements, we can just grab them all and see if they have the class names we are looking for. Unfortunately, the DOM library that comes with php has no way to get elements by class name, so we will have to write our own.


$all_trs = $dom_rep->getElementsByTagName('tr');
$trs_we_want = array();
foreach ($all_trs as $tr) {
  $class_name = $tr->getAttribute('class');
  if (preg_match("/chart_(even|odd)_row/", $class_name)) {
    $trs_we_want[] = $tr;
  } 
}

We wrote a simple loop, but we could have written a more robust function. In this case the loop is enough.

Now that we have all the elements we need, we can proceed to get the data. One thing to notice is that we will get 30 tr elements, but we are only interested in the first 10. We get 30 because we also get the ones from the other two tables in the page.

Lets loop our elements up to the 10th and get the data:


for ($i = 0; $i getElementsByTagName('td');
  $the_tds_arr = array();

  foreach ($the_tds as $td) {
    $the_tds_arr[] = $td;
  }

  $movie_title = $the_tds_arr[2]->nodeValue;
  $rank = $the_tds_arr[0]->nodeValue;
  $weekend = $the_tds_arr[3]->nodeValue;
  $gross = $the_tds_arr[4]->nodeValue;
  $weeks = $the_tds_arr[5]->nodeValue;
  echo "
"; echo "

$movie_title

"; echo "Rank: $rank
"; echo "Weekend: $weekend
"; echo "Gross: $gross
"; echo "Weeks: $weeks
"; echo "
"; }

As you can see, we are only looping and getting the data that we want. We created the $all_tds_arr array because we cannot access the $all_tds as an array. We could have used more DOM, but the idea here was to keep it as simple as possible. In this example we are only printing the info on screen, but on a real life situation you may want to save it to a file, a database, or a spreed sheet, or some other kind of back end that you have.

Here is all the code:


error_reporting(E_ERROR);

$url = "http://www.imdb.com/chart/";
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
$document = curl_exec($curl);

//echo $document;

$dom_rep = new DOMDocument;
$dom_rep->loadHTML($document);

$all_trs = $dom_rep->getElementsByTagName('tr');
$trs_we_want = array();
foreach ($all_trs as $tr) {
  $class_name = $tr->getAttribute('class');
  if (preg_match("/chart_(even|odd)_row/", $class_name)) {
    $trs_we_want[] = $tr;
  }
}

for ($i = 0; $i getElementsByTagName('td');
  $the_tds_arr = array();

  foreach ($the_tds as $td) {
    $the_tds_arr[] = $td;
  }

  $movie_title = $the_tds_arr[2]->nodeValue;
  $rank = $the_tds_arr[0]->nodeValue;
  $weekend = $the_tds_arr[3]->nodeValue;
  $gross = $the_tds_arr[4]->nodeValue;
  $weeks = $the_tds_arr[5]->nodeValue;
  echo "<div>";
  echo "<h2>$movie_title</h2>";
  echo "Rank: $rank<br />";
  echo "Weekend: $weekend<br />";
  echo "Gross: $gross<br />";
  echo "Weeks: $weeks<br />";
  echo "</div>";
}

On the next post we will see how we could use the libraries we talked about in the past for working with the DOM to make scraping easier.