Now That’s What You Call A “Point Failure”

It’s already over with by the time I noticed but this couldn’t have been good:

Research In Motion (RIM), the company that owns Blackberry, has experienced a worldwide infrastructure problem. RIM identified a core switch failure and noted that their fail-over to a redundant core switch failed to perform properly. There has been a cascading effect in message queues backing up worldwide as a result of the failure. RIM states that all messages will eventually be delivered as the system catches up on the backlog. In fact, RIM now states that most overseas customers are returning to normal operations. However, there continues to be a backlog of messages to be delivered in the North American markets. RIM continues to state that they are unable to provide an ETA as to when the backlog will finish processing. Thus these issues continue to have a sporadic impact on various City of Houston users of the Blackberry Services.

One switch fails and the whole world’s Blackberry messages back up? I sincerely hope our military and civilian leadership is ready for this node to either fail or be taken over by the Chinese if/when a war occurs.

This entry was posted in Technology and tagged . Bookmark the permalink.

5 Responses to Now That’s What You Call A “Point Failure”

  1. I think I’d call that a “cascading failure”.

    • Ubu Roi says:

      I thought you had to have multiple failures (well, more than two) to call it a cascade failure. Here we had only two — main switch failed, backup didn’t pick up the load. Is there something more to this that I’m missing?

  2. Overload is a form of failure. One classic kind of cascading failure is where one part of a distributed system goes down, and its load gets shifted to other units which are then overloaded. That can manifest in various ways depending on the system design.

    The other units can shut down, leading to catastrophic failure of the whole system. (Example: the 2003 NE power failure.) Or the system can start to oscillate as it tries and fails to cope with the increased load. (An obscure reference: that happened in 1987 to the Arpanet just after USS Stark was hit by a missile.) Or the system as a whole can simply grind to a constipated stop, which is what seems to have happened in this case. All of those are examples of cascading failure.

    And it is always ultimately an indication that the system doesn’t have sufficient overcapacity.

    • Ubu Roi says:

      I think my quibble is on the definition of the term Units” in relation to Blackberry’s switches — Based on the way they phrased it, I’m thinking Unit A (primary), Unit B (backup). The problem was that Unit B never kicked in properly; i.e. one failure. Of course, it’s not really “a” unit (singular), but a huge agglomeration of hardware and software (ok, maybe a room full), and they probably oversimplified the explanation. As customers, all we’d really want to know is “what went wrong?” and “when will it be fixed?”

      (Well, curmudgeonly skinflints will also want to know “How much are you going to refund on my bill for the inconvenience?” Here’s a hint: Zero.)

      My concept of cascade was really along the lines of the old adage:
      For want of a nail the shoe was lost,
      For want of a shoe the horse was lost,
      For want of a horse the rider was lost,
      For want of a rider the dispatches were lost,
      For want of dispatches the battle was lost.

  3. Sorry, not the Arpanet. It happened to the MILNET in 1987. I worked for BBN at the time, and the problem was that there was a hell of a lot of traffic going between North America and Europe. The Milnet only had three links crossing the ocean, and where most of its links were 56 kilobaud, those three were only 9600 baud. With the increased traffic, the distributed system balancing mechanisms started to ring, and it made the whole Milnet go into spasms. BBN ran the NOC, and the operators were pretty frazzled that day.

Leave a Reply