Elliott C. Back: Technology FTW!

Social Networking Uptime

Posted in Facebook, Google, Microsoft, Uptime by Elliott Back on February 26th, 2008.

My favorite blog in the world has a post about the year to date downtime of various social networks which is revealing. Not a single one achieves the famous “three nines” uptime SLA (although Amazon’s S3 service offers a two nines 99.99% uptime guarantee).

social-networking-uptime.png
Yahoo 360 isn’t a real social network, but it had great uptime

The best of these is Yahoo 360, with 99.9938% uptime over the last two months, followed by Myspace (99.969%), Facebook (99.8822%), Linked In (99.7024%), and finally Windows Live (99.4482%). MySpace was only down for 25m, while MSN Live Spaces had an embarrassing 7hrs 25m of downtime.

Online Apple Store Status (Realtime)

Posted in Apple, Uptime by Elliott Back on February 24th, 2008.

This is cool. Uptime monitoring company Pingdom has released a tool and widget for monitoring the Apple store:

Now you don’t have to wait for blogs to post about when the Apple store is down, you can just check their awesome service. Apple is interesting in that they take their online store offline to update it with new products; in that regard, it’s cutely old-fashioned. This year, changes in the Apple store brought us the new iPhones, the Macbook Air, and new Shuffle prices.

Bloglines Outage

Posted in Scalability, Uptime by Elliott Back on August 4th, 2007.

Trying to read my feeds I get some nice 500 errors from Bloglines:

Internal Server Error

The server encountered an internal error or misconfiguration and was unable to complete your request.

Please contact the server administrator, webmaster@bloglines.com and inform them of the time the error occurred, and anything you might have done that may have caused the error.

More information about this error may be available in the server error log.
Apache/2.2.5-dev (Unix) mod_ssl/2.2.5-dev OpenSSL/0.9.7a Server at www.bloglines.com Port 80

This kind of error is interesting because while Bloglines’ home page is up and working, their service is not, and that’s something very hard for monitoring tools like Pingdom to monitor without the cooperation of the web service. If there’s ever a standard created for an open web 2.0 service, an interface by which one can query which parts of it are up and down should factor in. It could be as simple as a ping, or as complex as a list of components and statuses. Just fire off a request to api.example.com/ping and get back “up” or “down.” You could use api.example.com/uptime for information about uptime and api.example.com/status for more detailed information.

AT&T iPhone EGDE Down

Posted in Apple, Cellphone, Uptime, iPhone by Elliott Back on July 3rd, 2007.

AT&T is getting overwhelmed! They just can’t keep their data network online, according to reports from Howard Forums:

Reports coming in on MacRumors that EDGE is down on iPhones… I can confirm, I haven’t been able to get any EDGE data all morning. I’m in Tucson, AZ. Users from Philadelphia, LA, Sacramento, and Washington have all reported the same thing. iPhone users - can you connect and get data via EDGE? (Simple test is to turn WiFi off, and open the weather widget - does it refresh?).

Another post suggests that there’s been almost half a day of downtime:

I have the cingular 8525, and both the EDGE/3G connectivity was down since morning today(07/02) in the san francisco bay area. I had called tech support and he made me do all sorts of things like take the battery out and put it back blah, before telling me that he had to make sure I haven’t done something to the settings, and that there is a “nationwide edge outage”. anyhoo, it seems to be back up now (about 4:00pm on the west coast).

When people are saying things like “Edge is down in Honolulu Hawaii since 9:45 am Hawaii this morning.” you have to wonder if AT&T is the right partner for Apple. I say not.

Dreamhost Sucks At Hosting

Posted in Hosting, Performance, Scalability, Uptime by Elliott Back 4 days, 18 hours ago.

I’ve concluded that Dreamhost sucks phenomenally at hosting websites that generate any kind of traffic. Sure, their $9.99 a month plans with massive savings coupons are enticing, but if you knew what you were getting yourself into, you’d stay away. Dreamhost sucks like you’d want to suck on a knife covered in chocolate–which isn’t very much.

hellhost.jpg

Others tell the tale better than I can:

There are also two unbelievable sites which actually claim that Dreamhost doesn’t suck. Well, if you read the above articles, you’d understand that Nightmarehost is really a bad dream.

So far not a single blog has explained at a high technical level why Dreamhost can’t handle their customers. I’ve seen some vague hand-waving about overselling, but no one actually has numbers to back it up. Sure, when someone tells me it takes them 10 minutes to go from SSH login prompt to terminal I believe them, but it’s not good enough. We’re making serious accusations about quality of service; we had better be able to back it up. We need hard data.

I have a shared hosting account on one of their machines, sepulveda.dreamhost.com [205.196.222.24]. The ping is fairly responsive, but not exceptional. They get their bandwidth directly from Level3, so it’s good bandwidth:

%ping -n 100 sepulveda.dreamhost.com
Minimum = 77ms, Maximum = 116ms, Average = 87ms

Unfortunately, there are 1200 users on my machine. I’ve seen industry guidelines that recommend far, far less than that, anywhere from 1/4 to 1/10th for shared hosting services:

[sepulveda]$ cat /etc/passwd | wc -l
1199

The machine itself appears to be a single dual-core opteron with 4GB of RAM, which isn’t hefty by any means. It should be dual dual-core and have 16GB of RAM to be at all useful. Besides, RAM is cheap–if they did put in more RAM maybe they could realistically handle 1-2k users per machine! Here’s the proc info:

/proc/meminfo:

        total:    used:    free:  shared: buffers:  cached:
Mem:  4172861440 3993358336 179503104        0 24576000 2158034944
Swap: 6465036288 326541312 6138494976
MemTotal:      4075060 kB
MemFree:        175296 kB
MemShared:           0 kB
Buffers:         24000 kB
Cached:        2033020 kB
SwapCached:      74436 kB
Active:         810260 kB
Inactive:      1321512 kB
HighTotal:     3211200 kB
HighFree:        34404 kB
LowTotal:       863860 kB
LowFree:        140892 kB
SwapTotal:     6313512 kB
SwapFree:      5994624 kB

/proc/cpuinfo:

processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 15
model		: 35
model name	: Dual Core AMD Opteron(tm) Processor 175
stepping	: 2
cpu MHz		: 2194.592
cache size	: 1024 KB
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 1
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic
sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse
sse2 ht syscall mmxext lm 3dnowext 3dnow pni
bogomips	: 4377.80

processor	: 1
vendor_id	: AuthenticAMD
cpu family	: 15
model		: 35
model name	: Dual Core AMD Opteron(tm) Processor 175
stepping	: 2
cpu MHz		: 2194.592
cache size	: 1024 KB
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 1
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep
mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht
syscall mmxext lm 3dnowext 3dnow pni
bogomips	: 4377.80

It’s funny that a guy actually monitored his site with Pingdom for a week. He writes in a comment on the dreamhost status blog:

Sorry, guys, but your service is simply terrible.
Today, there were just 83.11% of uptime (3h23min offline until now!) - data obtained from Pingdom.com.
Since 4/9, there were just one day with 100% uptime.
In the sum of last ten days, I just had more than 9 hours of downtime (while my other servers had no more than 15 minutes).
Every day I see server problems in my server.
That is unacceptable!!!

Dreamhost and I have been having conversations now for a while about a site which gets 1-2k visitors, and hosts 51GB of transferred static content a day. I thought you might be interested in reading them. On 4/30/2007, I received this email from Brian S. about my site:

Connections to your domain ( static.imgfly.com ) crashed the shared apache service several times this morning. A connection limit has been placed on your site. Being on a shared server means you need to share the resources with other customers. Due to the heavy volume of traffic, other domains on the same service were not able to load. Once the traffic to your site has taperd off, we will gladly remove the connection limit. Please read the appropriate section of our Terms of Service and let us know if you have any questions.

dreamhost.com/tos.html

My only thought is was dismal woe. If they don’t know how to configure Apache with the right connection and threading settings so it won’t crash, all is lost. Also, some of my DNS settings (I forget which exactly) were mangled in their blocking process, so I sent them a snarky email:

Unfortunately, you did far more than place a connection limit. You also edited my dns settings without my consent, which is strictly against any kind of ethical hosting policy. I had a CNAME on feifei.us pointing * to the domain so that subdomains would work; now it has mysteriously disappeared.

My account is advertised at having ~2.9TB of bandwidth a month; I pointed 50GB/day of static hosted content at my dreamhost account, which is about half of what I’m allotted, and you screw up my domain settings? You can’t handle half of what you promise?

I want to know exactly what kind of connection limit you’ve placed on my site, at a high technical level. I want to know why you can’t deliver even half of the bandwidth allotted to my account. This isn’t a high-level of bandwidth, it’s just a constant level of about 50GB a day, and it’s not even dynamic content, just static files.

At that time I was frantically re-routing and re-setting up the site, because all the DNS settings were lost, even ones pointing offsite and not at Dreamhost servers. They replied quite fairly to the email, I have to give them credit:

The CNAME was not removed as part of the connection limit. I’m sorry that it may have disappeared, but this was not as a result of the limit put in
place and was instead likely a bug in the system. You’re free to put it back if you’d like, but let me know if you don’t notice the wildcard working and I’ll get it set up for you again. I’m sorry that you believe we removed it, but we did not. The domain you mentioned was not one that was touched.

Bandwidth and connections are two separate issues. While you can definitely use all of the bandwidth that we offer, the number of connections per second and concurrent connections that your site was receiving was causing the Apache service to crash. If we didn’t put the limit in place you wouldn’t be able to use any of the bandwidth we provide as your site would have continued to crash the service. This would prevent yours and other customer’s websites from appearing online.

Unfortunately you’ve since removed hosting for this so I can’t provide you any sort of information about the limit that was put in place. I’ve explained the bandwidth issue above. Please don’t hesitate to write back if you need help with anything else.

Of course, I don’t buy the bandwidth explanation. They would have to allow me enough connections to actually constantly use 11Mbs for me to use it up. With 1200 customers on their server, even with a gigabit ethernet card they couldn’t fulfill their contracts if everyone used all their bandwidth. Some overselling is, of course, necessary and acceptable. Dreamhost goes a little overboard. The next day I noticed the new arrangements of sites I set up wasn’t working well, so I sent them a quick checkup email:

There’s some serious issues with this site; it took 8 connection tries to even connect to the server.

The reply I got back was shocking, even to me (emphasis my own):

Your domain imgfly-static.feifei.us is again causing serious problems on the sepulveda webserver. The number of requests coming in is almost instantly crashing the apache instance. This is unacceptable, and since instead of working with us you simply moved your problematic site to another domain I’m going to ask you now to stop running whatever you’re currently hosting on imgfly-static.feifei.us immediately, and to not start anything like it on any user, domain, or server of ours, ever again.

I’ve disabled imgfly-static.feifei.us to preserve the stability of the server, please do not re-enable it.

I only moved my domains around because (a) I wanted them that way in the first place, and (b) the DNS disappeared at some point. I had thought their connection limit was account-wide, but apparently my fixing the DNS also ruined their limiting. Today, since even though they suck, I sent them a nice, long, clearly written, conciliatory email. I really do want to get my 3TB of bandwidth out of them, and if it takes some sucking up, so be it:

I feel like we’re misunderstanding each other–I’m not trying to subvert your sepulveda cluster, and I am trying to work with you. Last time there was an issue, either my CNAME or A record for the subdomain somehow got lost in the dns, so I added a wildcard CNAME to feifei.us. I was under the impression that connection limits were placed on the account to prevent it from affecting sepulveda, but whatever change I made must have invalidated that.

I’ve just arrived home from work and I’ve pointed the stream of traffic you can’t handle elsewhere until we can work out the configuration. I’ve enabled the subdomain again, but there won’t be anything running on it until I get the ok.

Let me explain the software solution I’m running. The domain ImgFly.com is hosted on a dedicated server I run offsite. Requests for actual image content are forwarded to imgfly-static.feifei.us, which serves them as static content if they have been cached by that server, otherwise makes an attempt to fetch the resource remotely from amazon S3 and cache it to disk. Approximately 100 photos are uploaded an hour, and 2.1 GB of data downloaded. This isn’t much. It’s mostly serving random static content to consumers.

When I SSH into sepulveda, I see some indications that the problems you’re seeing aren’t my fault:

[sepulveda]$ uptime
16:40:39 up 26 days, 2:48, 5 users, load average: 10.67, 10.77, 9.30

Sepulveda is a dual-dual-core opteron, but those load averages are still pretty high. When I SSH in I can barely get a single command to run. Clearly the server is overloaded to the point where all it can do is serve an extremely limited amount of information. I’m willing to work with you to best manage your server resources, but you need to let me know exactly what you can handle.

Here are some solutions that come to mind:

1) Move me (or just feifei.us) to a reasonably loaded server
2) Configure apache so it doesn’t crash, or, give me an .htaccess file with reasonable limiting I can use
3) Run feifei.us without mod_rewrite or the cache script, just on a completely static filesystem
4) Tell me exactly how many connections you are able to handle, and I will send you only that many

I’m sure you’ll have some good ideas as well. It’s disappointing to be offered 3TB of bandwidth a month which is an unmetered constant rate of 10 Mbs, but be unable to fully utilize it.

I’m sure this post will get lots of comments… if any of you know someone willing to host 2+ TB of bandwidth for < $100 a month, let me know. I’m almost to the point of paying for another dedicated server to manage this.

Update:

I guess telling Dreamhost that I’d turn off the site which was causing them problems and work with them to figure out a better way to host it was a bad idea, because I received this lovely email a few moments ago:

Hello,

I’ve disabled your account for failure to comply with my request. This
is a permanent account closure, I’ve refunded your last payment.

James

So my DNS and whatever miscellaneous files (I think my dad’s site!) are there are currently being held hostage. Considering they’re my property, unless I hear from Dreamhost in the next three hours, I will be calling my lawyers tomorrow. I want blood now.

Update 2:

At 4/04/2007 5:19 PM EST I received a lovely email from Dreamhost, explaining that they weren’t doing anything to address the fact that my domains and data are being held hostage:

I have gone ahead and forwarded this to James, he will get back to you as soon as he can. Please wait for his reply.

I have forwarded the to XX so that they can update your incident report with this info, They will get back to you with any information that they may have if it is necessary, if so please wait for their reply.

It is now 24 hours since my first of three requests in writing for them to release my data and domains to me. If you know someone at Dreamhost who’s friendly and sane enough to let me pick up my things and leave, you should let them know about this. Otherwise, this is going to be a whole lot more painful tomorrow.

Update 3:

Some kind soul submitted this story to Digg. Yay! Maybe if it gets enough attention Dreamhost will give me my domains and access back.

Update 4:

Hello Digg crowd. Maybe I wasn’t clear about things. This site isn’t running on a Dreamhost machine. Hellllllll no. It would be down right now if it was. My Dreamhost account, which contains only domains, my dad’s low volume blog, and maybe one other blog, is currently disabled. The rest of it’s on a dedicated server I run from Cari.net, a company I’ve never had any problems with.

Update 5:

Now they’re telling me to wait. This should take a tech all of 2 minutes to resolve, just enable the account, wait for me to tell you I’m done with it, and then close it permanently:

It means that we have passed your message to the Tech that is responsible for disabling your site. While we understand your urgency to get this issue resolved, you will need to wait until the Rep is able to follow-up with you regarding the issue.

Update 6:

The Co-founder of Advection .NET emailed me and offered to help out. Really nice guy. If you need serious big time content-distribution and bandwidth, you should check out Advection .NET Global Media Hosting Network.

Update 7:

Ah, the sweetness of resolution. It’s a very long day and a half later, but I’m now almost again in possession of my files, databases, and domains. For some reason, the Dreamhost abuse team decided to zip up my files themselves. They probably didn’t trust me to download my files and be off again:

I’m sorry about the delay in this responce, it was in part due to the time needed to prepare all of your data.

I’ve tarred up your files and placed them at abuse.dreamhost.com/misc/user-content/xxx/ the login is xxx and the password is that which you used to login to the Webpanel. Your data has been split into three files, and will need to be concatenated before you can unzip it. Also there are your databases, they were spread between three database servers, so are in three different files. I will be removing the files after one week.

Here are the authorization codes for the domains you have registered with us. Once you’ve initiated a transfer out you can contact us again and we’ll approve the transfer.

If they had added a line to their two line email about terminating my account saying “We will provide your domains and files for transfer in __some timeframe__, please wait for our email” they could have averted this post.

Final Update:

I’ve finally gotten an email from someone in the know, a level 2 support manager. Hurray for moving up. His email is very nice:

I’m terribly sorry about the recent events that have transpired, it looks like we disabled you a little overzealously. I apologize for that. We’ve re-activated your account, however, it’s my understanding that you’re moving your domains off of our servers, which is completely understandable. If you’d still like to do that, we’ll be more than happy to refund that payment James offered.

As for the domain, imgfly.com, I’ve moved it over to your account, so you should be able to transfer. To save you time, I’ve included your auth code for transfer, should you need it. Now, if you do decide to stay, that’s great, however, we will need to discuss the stability issues you were having, however, I’ll save that, if the issue pops up if you decide to stay.

We do appreciate your willingness to work out the issue, and I apologize for the misunderstanding, and the overzealous nature on James’ part. If you’d like to discuss any other matters, please let me know at xxx@dreamhost.com, and I’ll be happy to speak with you.

I’m going to send him an email back about using imgfly-static.feifei.us as a cache node, because I really would like to use all my bandwidth up. Everything else is moving off, however.

Post-final Update:

I had thought that was the end of it, but I noticed this comment allegedly from Dreamhost’s Co-Founder and CTO, which includes the false statement:

3. He ignored our requests to not re-enable his website and did it anyway.

I tried to leave a comment explaining that I disabled my website per instructions, then put up an index page–which was somehow misconstrued by Dreamhost as ignoring their instructions. However, it’s been a few days and my comment has not been approved. Later comments have. I guess Dreamhost feels that they should have the right to slander me on their forum. I don’t feel the same way. If any Dreamhosters want to comment here, be my guest.

New News on Dreamhost

Recently 3,500 ftp accounts got hacked, including several high-profile websites like Cameron Moll. Still, there’s nothing on their official blog about this, like they want to cover it all up.

Next Page »