When I received the email notice from the TDWI Dallas Chapter about an upcoming Big Data event, I was interested. The meeting was at 8:00 a.m. on a Friday, the traffic wouldn’t be ideal, but it sounded like this might be a good opportunity.
Why would it be a good opportunity? Bill Inmon was in town!
Yes, Bill Inmon was right here in Dallas and I thought it could be very interesting to hear what he had to say. Bill, the self-proclaimed father of data warehousing and now a vocal leader on textual data, would probably be interesting and relevant. The purpose of his presentation was to let us know his take on Big Data. In Bill’s opinion the Big Data technology push is not providing the return on investment he believes it should—based on what most organizations are currently doing.
He provided an overview of Big Data and then broke it down into two categories: repetitive data and non-repetitive data. Examples of repetitive data are call records, machine data, click stream data, and so on. Non-repetitive data includes emails, healthcare records, warranty claims, corporate contracts, call center interchanges, and so on. He believes the real business value in big data lies within non-repetitive data.
Someone ask a question: what about structured data vs. unstructured data? He explained the definitions vary depending on who you ask. However, Big Data in the raw isn’t always delivered in a familiar sense as standard database management practices dictate. There may not be obvious keys, indexes, and so on.
He used an example of data on doctor’s notes. You have text – non repetitive data. That text has to be become disambiguous or translated to standard database format. How does that happen? You give the text context. Why? A doctor includes in his note “HA”. HA could mean a variety of different things depending on the doctor specialty and background – a family practitioner could right HA for head ache, a cardiologist could write HA for heart attack.
He believes non-repetitive data in the pot of gold at the end of the Big Data rainbow. He proclaims it’s where the real, true business value lives. So does he have a beef with Big Data? He feels it’s too focused on repetitive data. It’s not easy giving raw non-repetitive data a disambiguous transformation – but given context, it provides far more insight than repetitive data. His fear is that only repetitive data will create disappointing expectations for some organizations that are rushing to add Big Data technology.
Bill is currently the president of Forest Rim Technology (www.forestrimtech.com) which is described on their home page
Forest Rim Technology is a pioneer in technology called textual disambiguation. Textual disambiguation (or “textual ETL”) is technology that reads narrative text and edits that text into a standard data base format so that analytical processing can be done on the text. Textual disambiguation is useful wherever there is narrative information that needs to be analyzed by a computer such as in health care records, corporate contracts, call center dialogue, warranty claims, marketing research, and many more places…
Textual ETL operates on a standard Microsoft platform and reads and operates on all forms of electronic text. Textual ETL produces out in any standard form of data base including Oracle, Teradata, DB2, SQL Server, Hadoop, Netezza, and many more. Textual ETL reads text in any form of electronic text.
This probably explains some of his strong position on non-repetitive data to provide Big Data value, but he probably is right for many organizations looking for Big Data “gold.”