PHP Socket Implementation and Webpage Downloader
This post reveals two handy PHP constructs I’ve been using for a while now: a Socket class and a webpage-downloading function.
The files:
- Socket.php (highlighted; plaintext)
- webpage.php (highlighted; plaintext)
What good is a full-on Socket class and function, you may ask, when you can simply run file_get_contents with allow_url_fopen turned on? Well, consider that you want to send specific headers to the server, to POST data for example. Also, there are other uses for a Socket class than webpage downloading; I use this Socket implementation in a PHP-based IRC bot that I run (by the name of ecode on the Freenode network). You can use the socket to interact with many different servers.
Socket
The Socket class essentially encapsulates PHP’s socket_* functions, making it easy and straightforward to make a TCP connection with a server. The socket is given a timeout period in microseconds; while I’m not 100% sure what this means, I do know that it emulates a sort of “busy wait” while waiting for content, allowing applications to do processing while waiting for data to come in through the pipe.
To use the socket, include the Socket.php file and create a variable to hold a new Socket.
require_once("Socket.php");
$s = new Socket(1000);
The Socket will automatically set up the outgoing TCP socket for usage. The parameter passed to the new socket (1000 in this case) is the microsecond timeout — this can be left blank if you wish.
Once created, the socket can connect to any service through the connect() function. The function takes two arguments: host and port. Both parameters are optional to allow reuse of the connection; if called without parameters, the Socket will establish a new connection using a host and port from the last connection. The host is the address you would ping to see if the servers are up — for example, either google.com or 74.125.67.100, the Socket doesn’t care which as long as it’s a server it can connect to. The port is optional: if you specify a port number, that will be used; if no port number is specified, it attempts to extract a port number from the host (for example, if you used google.com:80); failing that, it will use the last port number specified (default of 80).
Once connected, you can use the send() and get() functions to — you guessed it — send() and get() data. The send function takes a string parameter, the message to be sent, sends the message through socket_write, and returns the value returned by socket_write (the number of bytes successfully sent through the socket). The get function takes an array parameter to specify fetching options: line (boolean) and length (integer). If line is set to true, then get will fetch one character at a time until a newline (\n) is encountered and return the line along with any trailing carriage returns (\r) and newlines (\n). If length is specified as greater than 0, get will fetch one character at a time until the specified number of characters have been read. If neither is specified, the function will read and return the next 512 characters. Of course, all cases will terminate if the end of data has been reached.
As a quick example, this portion uses the previously created socket to download the HTTP response headers from google.com’s homepage:
if ($s->connect("www.google.com", 80)) {
$get;
$s->setVerbose(false);
$s->send("GET / HTTP/1.1");
$s->send("Host: www.google.com");
$s->send("");
$response = "";
while (strpos($response, "\r\n\r\n")===false) {
$response .= $s->get(array("line"=>true));
}
}
The Socket supports logging and verbosity functions. The setLog function takes a boolean (true/false for logging) and a string (filename to log to). The setVerbose function takes only a boolean (true/false — default false — for printing out extra information to the console).
Finally, when you’re finished with the connection, you can call the disconnect() function to cleanly sever the connection.
webpage()
The webpage function takes three parameters: a url, a headers array, and a “headers only” flag. The url is simple enough; this is the full url that you want to fetch, including both the host name and the file path (and, optionally, a port number after the hostname). The headers array goes two ways; you can pass in headers to send and you’ll get back any headers received. The headers only flag specifies whether or not to fetch the content of the page; if not, it saves the time of downloading that content, and leaves you with only the response headers.
The headers parameter is two-way, as I mentioned. You can pass in an associative array of header-name => header-value pairs, which will be formatted properly and sent to the server along with the HTTP request.
$h = array(
"User-Agent" => "PHP-webscraper/1.0 php/5",
"Authorization" => "Basic ".base64_encode("username:password")
);
$page = webpage("http://twitter.com/statuses/mentions.json", $h);
The above will send the User-Agent, for identification, and Authorization, to login, headers to the Twitter API, fetching any tweets that mention the user name “username” (using the password “password”). The response will be in JSON, and will require extra processing, but this will get you the data. For more information on HTTP headers, check out the HTTP 1.1 RFC, section 14.
Once the page is received from the server, the headers array you passed in will be overwritten by the server’s response headers, which you can then examine with external code. With this method, you can also pass in a simple empty array, which will then be populated with the response headers.
The third parameter to the webpage function is simple enough. Pass true if you want to receive only the server’s response headers, cutting off the response once those have been received; a blank string will be returned from the function. Pass false (or no third parameter) if you wish to receive the entire page, the contents of which will be returned as a string.
With these two files (and, of course, the PHP constructs they define), you can download any webpage you would have access to with your browser, provided you are able to use sockets on your server. Need to scrape some data from a page? Download that page and apply some regexes to the content. HTTP Auth required? No problem, send an extra header with your creds. The function could also be modified to POST data to a page, or even download and save images and other media; however, those functions will be left as an exercise for the reader.
Subscribe via Email Alerts
Subscribe in an RSS reader
What is RSS?
Eternicode on Twitter
Leave a Comment