Question : Scraping data from another website's HTML using PHP

Hi There,

I'm trying to scrape some data from a HTML table on the website of a local radio station. They have a recently played songs list and I'd like to do some analytics on that data.

The page I'm trying to retrieve the data from is available here:

http://www.channel103.com/music/index.php?qty=100

Fortunately the table is generated automatically and the amount of songs it displays is based on the value taken from the URL so I have a potentially limitless dataset to work with (although I've specified 100 songs as an example).

I'd eventually like to end up with the data from that table in an array or a mysql database (I want the Time Played, Song and Artist information for every entry.) However I'm unsure as to how to go about getting that information (I'm new to PHP Programming, but I understand most core programming concepts at least to a basic level).

I've played around with using regular expressions and so on and have managed to write a script that lists the currently playing song and artist, however I've come to a standstill now and can't workout where to go next. I've had a look around on the net and here on EE and XPATH seems to be a common route for similar problems but I'm struggling to get to grips with it.

Here is the PHP Code I've written so far (massively confused by the output I'm getting!):

1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
27:
28:
29:
30:
31:
32:
33:
34:
35:
36:
37:
38:
39:
40:
41:
42:
43:
44:
45:
46:
47:
48:
49:
50:
51:
52:
53:
54:
55:
56:
57:
58:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">

<head>
	<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
	<title>Tom's 103 Analysis</title>
	<link href="style.css" rel="stylesheet" type="text/css" />
</head>

<body>

<?php 

/* 	Author: 	Tom Hacquoil
	Date: 		25th August 2010      */


/* PART 1: Get currently playing song and artist. */

	# Put the contents of the source of the destination website into a 'content' variable.
	$content = file_get_contents('http://www.channel103.com/music/index.php?qty=50');
	
	# Using Regular Expressions, scan the file and everytime a match occurs, put data into the 'data' array.
	preg_match('#<div><span>now playing &ndash; </span><a href="http://www.channel103.com/music/index.php">(.*)</a><span>(.*)</span></div>#', $content, $data);
	
	# Assign the contents of the 'data' array to two variables, song and artist.
	$song = $data[1];
	$artist = $data[2];
	
	# Print the content of those variables.
	echo "<strong>Song:</strong> $song - <strong>Artist:</strong> $artist\n";
	
	echo "<br /><br />";
	
	
/* PART 2: Get a list of all recently played songs. */

	# Put the contents of the source of the destination website into a 'content' variable.
	$content = file_get_contents('http://www.channel103.com/music/index.php?qty=20333');
	
	# Using Regular Expressions, scan the file and everytime a match occurs, put data into the 'data' array.
	preg_match('#<tr class="tabletextRow1"><td>(.*)</td>#', $content, $data);
	
	# Print first entity of the array (for testing).
	echo $data[1];
	
	echo "<br /><br /><br />";
	
	# Print the entire array. (For testing).
	print_r($data);
		

?>

</body>

</html>

Answer : Scraping data from another website's HTML using PHP

Tom,
I am not an expert on regex, but you should be using preg_match_all which returns an array rather than a string. The attached code will print out the artist and song title. I am sure if you manipulate the regex you will only extract the data you want, as it is the array is [0] time artist song [1] time [2] artist [3] song.

You could even reduce this regex and use substr on the first array to extract the info you want.

(I reduced the number of extracted items to 10 so I would not get a bonkers amount of information)
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
<?php 


/* PART 2: Get a list of all recently played songs. */

	# Put the contents of the source of the destination website into a 'content' variable.
	$content = file_get_contents('http://www.channel103.com/music/index.php?qty=10');
	
	$pattern = '#<tr class="tabletextRow.">\r\n<td>(.*)</td>\r\n<td>(.*)</td>\r\n<td>(.*)#';
	
	preg_match_all ($pattern, $content, $data);
	
	//var_dump($data);
	
	for ($i = 0; $i < 11; $i++)
	{
	    echo "<br /><br />". $data[2][$i].' '.$data[3][$i];
	}
	
?>	
Random Solutions  
 
programming4us programming4us