Basic Webpage Scraping with PHP
I've been using this technique for a while now but it's just recently when I heard the term "scraping". Apparently, people use this as a "black hat" technique to earn some money online, although, I don't have any idea how someone can use this to earn some money online (If you have any idea, your comments are welcome). So in this tutorial, you won't be learning any black hat method to earn money but I will show you how to scrape a web page using PHP.
In this tutorial, I will use fopen to read the webpages content. So it is important to note that in order for this method to function, the php.ini must be configured so that allow_url_fopen option is enabled. So far, in all the webhosts that I've used, this option is enabled. This option is also enabled by default when you install PHP.
A program designed to scrape a webpage (or even a whole website) simply goes to the website url and read its HTML code. Your program can do lots of things with this HTML code. Here are some ideas:
- Get all the links found in the page.
- Your program can visit an online store and gather all information about products.
- Go to a Google search query page and have a data of the current rankings for a certain keyword.
- and many more..
Personally, I once wrote a crawler/scraper that visits a forum and get the number of online users for a specific time interval. The scraper that we will create in this tutorial will use the same concept I used in that program I mentioned. Now, that you know what a scraper does, let's get our hands dirty and start coding. What we're going to do here is go to Google's home page and get all the URLs of the links in that page. This maybe a lot simpler than you think.
The sample code above was taken from PHP Manual's fopen page (replacing "example.com" with "google.com"). What it does is it visits the URL "http://www.google.com/", get the HTML code of that page, and store it in a string variable "$contents".
Now that we have the HTML code of the target webpage, next thing we need to do is get all the URLs in the links. We can do this by using regular expressions. Below you will find the regular expression that will gather all urls in the page and store in in an array.
-
$contents, $matches);
I will not explain this function in detail for that is beyond the scope of this tutorial. But basically, what it does is get all URLs in the string variable, $contents, and store it in an array variable, $matches.
Now that we scrapped Google's home page and listed all URLs contained in links, this time let's display all those URLs in the browser screen. All we need to do is iterate through the $matches variable and "echo" every item in it. The complete code is listed below.
-
<?php
-
-
# Use fopen to scrape Google's homepage and
-
# store it's HTML code in the $contents variable
-
$contents = '';
-
}
-
-
# Get all URLs in links from the HTML code
-
# and store it in an array, $matches.
-
$contents, $matches);
-
-
# Display all these URLs.
-
foreach ($matches[1] as $url) {
-
";
-
}
-
-
?>
Posted on: May 14, 2008 Filed under: PHP
Comments (5)
PHP Screen Scraping
May 26th, 2008 at 7:24 pm
Hey nice php screen scraping tutorial I’ve had some up for a while with little more advanced stuff as well.
Will
May 26th, 2008 at 7:45 pm
Very useful, thank you.
Tony
June 15th, 2008 at 1:47 am
This works great! Now to learn Regular Expressions bwahahah…….
sirap
June 22nd, 2008 at 5:23 am
i don’t whats the use of it i still beginners about php.. i want to learn about it more.
pjaxon
December 10th, 2009 at 12:11 am
Re:” I don’t have any idea how someone can use this to earn some money online”. There are a number of ways that come to mind.
One business involves scraping target sites and processing the information with algorithms that rate public opinion. They generate reports based on all sorts of criteria in their analysis engine and then they sell that info to clients that are interested in web trends. Obviously a lot more than just scraping data, but their inventory and product are all based on what they scrape.
Leave a reply