What IBM’s Watson Says to Storage Systems Developers, Blog by John Webster

By , Tuesday, February 15th 2011

Tags: analytics, Big Data, IBM, Jeopardy, Watson,

IBM’s Watson debuted for a national prime time TV audience last night on CBS’ Jeopardy. Well, to be accurate, his avatar glowed behind his center-stage podium. He did however have a real button to push when he was ready to tee up a Jeopardy-formatted question-as-answer. The button was activated by a specially designed application running within his offstage IBM POWER7 server cluster, complete with IBM Scale-Out NAS (SONAS) storage.

From my perspective Watson was truly amazing during the first 15 minutes of the show, giving responses and choosing the next question category with blinding speed. Human contestants Brad Rutter and Ken Jennings stood silently and looked on as Watson racked up his winnings. And then Watson seemed to stall a bit toward the end. He actually gave the same wrong answer as one of the contestants. During the second 15-minute segment, Brad caught up and Ken dug himself out of a hole. Hmmm. Did Watson decide against humiliating his human creators

We’ll probably never know, so let’s focus on things we do know. From a storage perspective, much is being made of the massively huge volumes of data Watson feeds on and his ability to calculate the probability of a “right” answer from a list of several potential winners in about three seconds or less. Watson’s ability to parse big data combinations and permutations in real time leads to IBM’s planned extension of Watson’s underlying technology into big data analytics.

All well and good. But here’s what I find most interesting as a result of what IBM has done in response to the Grand Challenge that motivated Watson’s creators. We know, from Tony Pearson’s blog, that the foundation of Watson’s data storage system is a modified IBM SONAS cluster with a total of 21.6TB of raw capacity. But Pearson also reveals another very significant, and to me, surprising data point: “When Watson is booted up, the 15TB of total RAM are loaded up, and thereafter the DeepQA processing is all done from memory. According to IBM Research, the actual size of the data (analyzed and indexed text, knowledge bases, etc.) used for candidate answer generation and evidence evaluation is under 1 Terabyte.”

What Pearson just said is that the data set Watson actually uses to reach his push-the-button decision would fit on a 1TB drive. So much for big data?

For me, Watson speaks eloquently to what I think of as the big data conundrum. Yes, the new business analytics systems I’ve been writing about are fed massive amounts of data from multiple sources, and yes, big data represents a huge opportunity for storage vendors. Then Watson steps up and says, “Time out, guys. I only need a terabyte.”

Watson knows that, at any given time, only a tiny fraction of the data he has at his disposal is actually relevant to the problem he’s solving. What he and his creators have learned to do, after playing many mock Jeopardy games back in the lab, is develop an incredibly precise and compact data set that fits in Watson’s memory. In fact, Watson’s memory can easily handle multiple copies and versions of the data set.

What Watson reveals, I think, to storage professionals and vendors alike is not just the need for massive amounts of storage both at the data ingest and archival stages, but also the need to develop what I think of as a relevance engine. The question Watson poses to storage system developers is this: Can you feed me only the relevant data? Yes, storage system cache is a kind of relevance engine, but a primitive one at best when compared with what Watson has achieved.