Advertisement
  1. Archive

Where healthcare.gov website went wrong

NEW YORK

Of all the terrible websites I've seen, healthcare.gov ranks somewhere in the middle. It has been difficult if not impossible to sign up, and customer service has been inadequate. But healthcare.gov's failures are not uncommon -- just exceptionally high-profile. And the problems plaguing healthcare.gov aren't due to a unique coding failure or a unique government failure.

Healthcare.gov's biggest problems are most likely not in the front-end code of the site's Web pages, but in the back-end, server-side code that handles - or doesn't handle - the registration process, which no one can see.

The site's front end (the actual Web pages and bits of script) doesn't look too bad, but it is not coping well with whatever scaling issues the back end (account storage, database lookups, etc.) is having. I tried to sign up for the federal marketplace six days after rollout. The site claimed to be working, but after I started the registration process, I sat on a "Please Wait" page for 10 minutes before being redirected to an error page:

"Sorry, we can't find that page on HealthCare.gov."

Except that wasn't the problem, since the error message immediately below read:

"Error from: https%3A//www.healthcare.gov/oberr.cgi%3Fstatus%253D500%2520errmsg%253DErrEngineDown%23signUpStepOne."

To translate, that's an Oracle database complaining that it can't do a signup because its "engine" server is down. So you can see Web pages with text and pictures, but the actual meat-and-potatoes account signup "engine" of the site was offline.

A good site would have translated that error into a more helpful error message, such as "The system is temporarily down," or "President Obama personally apologizes to you for this engine failure." But it didn't.

This failure points to the fundamental cause of the larger failure, which is the end-to-end process. That is, the front-end static website and the back-end servers (and possibly some dynamic components of the Web pages) were developed by two different contractors. Coordination between them appears to have been nonexistent.

So we had (at least) two sets of contracted developers, apparently in isolation from each other, working on two pieces of a system that had to run together perfectly. Anyone in software engineering will tell you that cross-group coordination is one of the hardest things to get right, and also one of the most crucial, because while programmers are great at testing their own code, testing that their code works with everybody else's code is much more difficult.

Look at it another way: Even if scale testing is done, that involves seeing what happens when a site is overrun. The poor, confusing error handling indicates that there was no ownership of the end-to-end experience - no one tasked with making sure everything worked together and at full capacity, not just in isolated tests. No end-to-end ownership means that questions like, "What is the user experience if the back-end gets overloaded or has such-and-such an error?" are never asked, because they cannot be answered by either group in isolation.

Likewise, the bugs around username and password standards - for example, the fact that the username required a number but the user interface didn't tell the user about it - are not problems of scale. They're problems of poor cross-group communication. I'd bet that plenty of people knew what was going to happen when the site rolled out, but none of them were in a position to mitigate the damage.

Each group got its piece "working" in isolation and prayed that when they hooked them together, things would be okay. When they didn't, it was too late. It is entirely possible that back-end developer is primarily at fault here, but no one will care because they just see that the whole thing doesn't work. As you learn early on in software development, there is no partial credit in programming.

Bugs can be fixed. Systems can even be rearchitected remarkably quickly. So nothing currently the matter with healthcare.gov is fatal. But the ability to fix it will be affected by organizational and communication structures. People are no doubt scrambling to get healthcare.gov into some semblance of working condition.

The fastest way would be to appoint a person with impeccable engineering and site delivery credentials to a government position. Give this person wide authority to assign work and reshuffle people across the entire project and all contractors, and keep his schedule clean. If you found the right person things would come together quickly.

We live in a world that embodies the paradox of George W. Bush's responsibility society (a.k.a. the "ownership society"), where authority and accountability are increasingly separated. Power flows upward while responsibility flows downward, which is why you couldn't pay me to work as a government contractor. It'd be like going back to Microsoft.

David Auerbach is a writer and software engineer based in New York. © 2013 Slate

YOU MIGHT ALSO LIKE

Advertisement
Advertisement