How many users does DIGG have?
When John Graham-Cumming asked the question How Many Users Does Digg Have?, there were a few things he couldn’t tell you, since his data consisted of randomly self-sampled users. Well, with the power of two PHP scripts, we can pull large amounts of user data and form queries. Our first question is how has DIGG grown over time?

A graph of 187,054 digg users, randomly plotted against when they joined
This doesn’t tell us much, though, about how many DIGG users there actually are, or how active they are, so I plotted a histogram of the number of times these 200k users’ profiles had been viewed; the answer, unsurprisingly, is not very often in most cases:

83% of users had less than 50 profile views
And what about users who are active? How many people are digging stories every day? The answer is very few. I took a sample of 29,225 users from the previous sample (randomly) and used the DIGG API to query for their last digg. It turns out 31% (9125) had never dugg anything! After I removed those, here is the histogram I got:

About 15% of Digg users dugg a story in the last week
Concluding thoughts
Digg boasts an official tally of 2.2M users, but at most 20% of them can be considered real, active users. That would bring their user count down to 440,000, far far less than a popular web 2.0 boom child can boast about, and significantly hurting that $300M (or ~$700 a user) valuation that they keep trying to get.
Code Appendix
The {digg user, time joined, digg id, profile page views} information was gathered by the following script:
<?php
error_reporting(E_ALL);
ini_set(‘user_agent’, ‘My-Application/2.5′);
ini_set(“include_path”, “.:/usr/share/pear”);
require_once ‘Services/Digg.php’;
require_once ‘Services/Digg/Response/php.php’;
$base = ‘http://services.digg.com/users/?appkey=http://example.com&type=php’;
$data = unserialize(file_get_contents($base.‘&count=0′));
$total = $data->total;
echo “There are $total total users\n”;
echo “ID,Number,Name,Date,Views\n”;
for($i = 0; $i < 1000; $i++){
$offset = rand(0, $total - 100);
$data = unserialize(@file_get_contents($base.‘&count=100&offset=’.$offset));
$j = 0;
foreach($data->users as $user){
$page = @file_get_contents(‘http://digg.com/users/’.$user->name.‘/’);
if(!$page)
continue;
preg_match(‘/id=”userid” value=”(\d+)”/i’, $page, $matches);
echo $matches[1] . “,”;
echo ($offset + $j++) . “,”;
echo $user->name . “,”;
echo $user->registered . “,”;
echo $user->profileviews .“\n”;
}
}
?>
Ruby vs PHP Performance Revisited
Ignoring any of Hongli Lai’s actual code, I reran the PHP, Ruby, C++, Perl, and Python mergesort benchmarks he gave, and came up with substantially different results. Here are the versions of the programming languages I am using for the test:
- PHP - PHP 5.1.6 (cli) (built: Sep 18 2007 09:07:28)
- Ruby - ruby 1.8.5 (2007-09-24 patchlevel 114) [x86_64-linux]
- Perl - This is perl, v5.8.8 built for x86_64-linux-thread-multi
- Python - Python 2.4.4 (#1, Oct 23 2006, 13:58:18)
- C++ - gcc version 4.1.2 20070626 (Red Hat 4.1.2-13)
- Java - Java(TM) SE Runtime Environment (build 1.6.0_10-ea-b10)
You’ll notice I’m adding Java into the mix for fun. Here’s the results, over 10 runs, on an Intel Dual-core 1.80GHz machines with 2Gb of RAM currently running this website:

Lang Average Min Max PHP 8.8325 8.637 9.303 Ruby 7.2896 7.143 7.729 Perl 4.3231 4.262 4.428 Python 3.3465 3.289 3.417 C++ 0.5638 0.53 0.609 Java 0.4062 0.262 0.551
There are a couple important conclusions to note here that are significantly different than Hongli Lai’s:
- PHP is 21% slower than Ruby, not 41% as in his benchmark
- Python is 29% faster than Perl, not 17% as in his benchmark
- Java runs this 39% faster than C++, and 2100% faster than PHP
So, PHP is slower than Ruby, but not quite as slow as Hongli Lai would have you believe. Python is the fastest scripting language in this benchmark, while Java is the faster language all around, and is incredibly, incredibly fast. Maybe all of our code should start using java!
* NOTE: I am ignoring the obvious deficiencies of this micro-benchmark and just trying to reduplicate it. What I’ve found is that there are significant discrepancies between Hongli Lai’s run of the tests and my own, probably owing to slightly different versions of the components involved. Also, if I make some trivial optimizations to the loops in the PHP script, I can get it to run faster than everything but C++, in about 2.4s. Then again, just calling sort() is faster by another two orders… but still half as slow as Java’s built-in sort… and two orders slower than perl’s built-in.
Benchmarking Wordpress with Apache Bench
A lot of people talk about Wordpress performance, and how to get a webserver to perform as efficiently as possible. However, without a quantifiable methodology to testing website performance, you can’t actually talk about it. ApacheBench (ab) is the solution to the problem of measuring website performance. What is ApacheBench? The man page provides a suitable answer:
ab - Apache HTTP server benchmarking tool
ab is a tool for benchmarking your Apache Hypertext Transfer Protocol (HTTP) server. It is designed to give you an impression of how your current Apache installation performs. This especially shows you how many requests per second your Apache installation is capable of serving.
If you have installed apache or apache-devel, you should be to simple invoke ab by typing it on the command line. For example, to benchmark my own site here, I would write:
[root ~]# ab -n 10000 -c 100 http://elliottback.com/wp/
This says “make 10,000 concurrent requests to host elliottback.com via http and request /wp/ on 100 threads.” The result of this is the following report:
This is ApacheBench, Version 2.0.40-dev < $Revision: 1.146 $> apache-2.0
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright 2006 The Apache Software Foundation, http://www.apache.org/Benchmarking elliottback.com (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Finished 10000 requestsServer Software: Apache/2.2.6
Server Hostname: elliottback.com
Server Port: 80Document Path: /wp/
Document Length: 34331 bytesConcurrency Level: 100
Time taken for tests: 13.596345 seconds
Complete requests: 10000
Failed requests: 0
Write errors: 0
Total transferred: 346230000 bytes
HTML transferred: 343310000 bytes
Requests per second: 735.49 [#/sec] (mean)
Time per request: 135.963 [ms] (mean)
Time per request: 1.360 [ms] (mean, across all concurrent requests)
Transfer rate: 24868.08 [Kbytes/sec] receivedConnection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 1.6 0 20
Processing: 8 134 12.7 132 190
Waiting: 4 134 12.7 132 190
Total: 16 134 12.1 132 190Percentage of the requests served within a certain time (ms)
50% 132
66% 134
75% 136
80% 137
90% 145
95% 160
98% 175
99% 179
100% 190 (longest request)
According to these numbers, my dual core server can do 750 requests per second, fulfilling each within about 150ms each. That’s pretty fast, probably because I know the secrets of Wordpress Optimization. If you make every layer as fast as it can be, and cache heavily, you too can see lightening fast Wordpress installations!
Mark Cuban’s P2P Ideas Suck
In a three-part rant about peer-to-peer technologies (1, 2, 3), Mark Cuban demands that peer-to-peer technologies “die a quick death” in order to”speed up [his own] internet connection.” He suggests that “Google Video is a far better solution for audio and video distribution than any P2P solution” and that cable companies “charge for upstream bandwidth usage.”
Guess what–I already get charged for all the bandwidth I use, either up or down. When Verizon strings a fiberoptic cable to my home, I’m getting a certain amount of fixed capacity into the greater internet at large. If I want to trade a little upstream capacity for greater downstream capacity, that’s my call! Have you ever noticed that downloading over http is typically slow because there are 100s of clients and 1 host? If I download the same information over bittorrent, I can sustain 12Mbs because everyone is a server–including me. Distributed protocols, such as the ones powering Amazon Dynamo or bittorrent, are more efficient, cost effective, and fault tolerant than single-server models.
Reactions around the blogosphere indicate that Mark Cuban’s thoughts on P2P are nonsensical rubbish. Mashable calls him “a guy who does not understand how P2P works, and yet he wants it shut down.” Ars Technica notes that “if users who are currently saturating their connections with BitTorrent start saturating their connections with Google Video content, the end result is more or less the same.” And a slashdotter comments, “Just imagine how fast the internet would be if there were no content to view. After P2Ps gone, get rid of all these freeloading websites, emails, etc. and it will be blisteringly fast.”
My guess is that billionaire Mark Cuban has a slow, shared cable internet connection at home, the modern equivalent of a party line. This might lead him to confuse his own slow internet connection with a greater systemic problem. What he should be complaining about is why Verizon hasn’t strung fiber in his area yet.
Bloglines Outage
Trying to read my feeds I get some nice 500 errors from Bloglines:
Internal Server Error
The server encountered an internal error or misconfiguration and was unable to complete your request.
Please contact the server administrator, webmaster@bloglines.com and inform them of the time the error occurred, and anything you might have done that may have caused the error.
More information about this error may be available in the server error log.
Apache/2.2.5-dev (Unix) mod_ssl/2.2.5-dev OpenSSL/0.9.7a Server at www.bloglines.com Port 80
This kind of error is interesting because while Bloglines’ home page is up and working, their service is not, and that’s something very hard for monitoring tools like Pingdom to monitor without the cooperation of the web service. If there’s ever a standard created for an open web 2.0 service, an interface by which one can query which parts of it are up and down should factor in. It could be as simple as a ping, or as complex as a list of components and statuses. Just fire off a request to api.example.com/ping and get back “up” or “down.” You could use api.example.com/uptime for information about uptime and api.example.com/status for more detailed information.
