IBM BigInsights Has Potential If It Lives Up To Its Promise

http://insights.iolap.com/author/psukumar/

Prakash Sukumar is a Principal Consultant at iOLAP, Inc., and specializes in Big Data Architecture. He has had many years of experience architecting Big Data Platforms and Data Warehouses. He has worked in various roles including architect, leading teams, administrator, analyst and developer. Prakash has special interest in emerging technologies and is always looking for new and promising methods and technologies that help businesses perform better.

View all author posts →

IBM released Hadoop-based InfoSphere BigInsights in May 2013. There are already Hadoop-based commercial distributions from other vendors such as Cloudera, HortonWorks and MapR. So it was interesting to learn how IBM stacks up against other vendors in the Big Data landscape.

I learned more about this because I had the opportunity to get hands-on with the InfoSphere BigInsights Big Data ecosystem the week of October 7, at an IBM boot camp. My initial impression is that IBM’s technology competes strongly with others in the industry—probably more so for customers who have already invested in other IBM technologies such as PureData System for Analytics, DB2 and Data Stage. The new InfoSphere BigInsights system complements other IBM products and integrates very well.

InfoSphere BigInsights – A Closer Look

Listed below are some interesting key features that make IBM stand out from the competition.

Adaptive MapReduce / Workflow Management

Adaptive MapReduce is an IBM term for executing smaller map reduce tasks quickly with low latency scheduling instead of waiting in the regular queue of long running map reduce tasks.

IBM claims processing time of Adaptive MapReduce tasks are reduced due to usage of native C/C++ rather than Java. This is further accomplished by how certain map reduce tasks are executed. Mappers can decide at runtime to take on more work (until it doesn’t make sense anymore).

Thus Workflow Management is achieved by speeding up class of jobs that process small files.

GPFS/FPO High Availability

As part of the Big Insights platform, IBM is providing out of the box High Availability with a seamless, automatic and transparent failover for HDFS (Hadoop Distributed File System) NameNode and JobTracker, thereby eliminating administrative intervention and reducing downtime for the cluster.

With GPFS/FPO (General Parallel File System/File Placement Optimizer) support, you get an enterprise-grade Portable Operating System Interface (POSIX) compliant file system that enhances how data is accessed and stored in InfoSphere BigInsights and removes the single point of failure. It also has a snapshot capability at the operating system level.

Text Analytics

Text Analytics is used to accurately analyze unstructured and semi-structured textual data. IBM claims its text analytics provides correct answers twice and is 10x faster compared to the alternatives currently available in the market.

Here are some key features of the text analytics module.

Parses text and detects meaning with annotators
Understands the context in which the text is analyzed
Hundreds of pre-built annotators for names, addresses, phone numbers, among others
Out of box international support for multiple languages
Distills structured information from unstructured text
Sentiment analysis and consumer behavior

BigSheets

This is a powerful, Excel-like platform to explore, manipulate, transform and represent data primarily intended for analysts and requires no prior programming experience.

Behind the scenes BigSheets runs PIG and map reduce scripts to execute data on the underlying Hadoop cluster. Users are able to do joins, filter, unions and various other transformations on data from multiple sources. Final data displays can also be graphical charts. Currently BigSheets supports line, column, bar and pie charts. BigSheets can source data from files (JSON, delimited), all major RDBMS (via JDBC) and Hive.

Enterprise Integration

DataStage provides integration with a broad range of sources.

Connector integrates BigInsights and the underlying HDFS file system
Leverages clustered architecture
The stage mirrors the existing Sequential File Stage, providing similar functionality
Automated Map/Reduce Job generation.

Integration with a RDBMS

BigInsights uses Database Import to load data from a RDBMS into a file on HDFS
Uses Database Export to write data from files to a table in RDBMS

Integration with Cognos Business Intelligence

Users can easily access unstructured data providing Business Analysts with exposure to the key conclusions found in large volumes of text
By using the Hive JDBC driver, Cognos Business Intelligence can incorporate data from InfoSphere BigInsights into business intelligence analysis and reports
Generates Hive QL to query the BigInsights File System
Metadata from Hive Catalog can be imported into Cognos Framework Manager
Users can now use a BI modeler to create Cognos reports, dashboards, and workspaces while using the InfoSphere BigInsights MapReduce capabilities

Security

Big Insights provides a robust integrated security framework.

Single sign on/one-step login where applicable
Supports security at group/user/document and even field/column level in a data explorer module
In addition to regular LINUX/UNIX level control to determine access for users and groups at the document/file level, it also supports LDAP, Active Directory
Supports both early (metadata level) and late binding (query time) for ACL (Access Control Lists) checking

BigSQL

BigSQL is a software layer that enables users to create tables and query data in BigInsights using familiar standard SQL statements.

Big SQL Architecture

The BigSQL query engine supports joins, unions, grouping, common table expressions, and other familiar SQL expressions. Big SQL can read data directly from relational DBMS systems.

Depending on the query, BigSQL can use Hadoop’s MapReduce framework to process various query tasks in parallel or execute the query locally within the BigSQL server on a single node (whichever may be most appropriate for the query). For instance, queries on smaller tables with less data would have unnecessary overhead if the query is going to run map reduce jobs in parallel in the Hadoop system. Instead BigSQL has functionality that queries on one single node as explained above.

My impression during the week as that the above features and functions are impressive. It will be interesting to see if the technology delivers as promised in the real world. We will all be watching.

NOTE: You can also download the entire article in PDF format by clicking here: BigInsightsArticle.