IBM released Hadoop-based InfoSphere BigInsights in May 2013. There are already Hadoop-based commercial distributions from other vendors such as Cloudera, HortonWorks and MapR. So it was interesting to learn how IBM stacks up against other vendors in the Big Data landscape.
I learned more about this because I had the opportunity to get hands-on with the InfoSphere BigInsights Big Data ecosystem the week of October 7, at an IBM boot camp. My initial impression is that IBM’s technology competes strongly with others in the industry—probably more so for customers who have already invested in other IBM technologies such as PureData System for Analytics, DB2 and Data Stage. The new InfoSphere BigInsights system complements other IBM products and integrates very well.
InfoSphere BigInsights – A Closer Look
Listed below are some interesting key features that make IBM stand out from the competition.
Adaptive MapReduce / Workflow Management
Adaptive MapReduce is an IBM term for executing smaller map reduce tasks quickly with low latency scheduling instead of waiting in the regular queue of long running map reduce tasks.
IBM claims processing time of Adaptive MapReduce tasks are reduced due to usage of native C/C++ rather than Java. This is further accomplished by how certain map reduce tasks are executed. Mappers can decide at runtime to take on more work (until it doesn’t make sense anymore).
Thus Workflow Management is achieved by speeding up class of jobs that process small files.
GPFS/FPO High Availability
As part of the Big Insights platform, IBM is providing out of the box High Availability with a seamless, automatic and transparent failover for HDFS (Hadoop Distributed File System) NameNode and JobTracker, thereby eliminating administrative intervention and reducing downtime for the cluster.
With GPFS/FPO (General Parallel File System/File Placement Optimizer) support, you get an enterprise-grade Portable Operating System Interface (POSIX) compliant file system that enhances how data is accessed and stored in InfoSphere BigInsights and removes the single point of failure. It also has a snapshot capability at the operating system level.
Text Analytics is used to accurately analyze unstructured and semi-structured textual data. IBM claims its text analytics provides correct answers twice and is 10x faster compared to the alternatives currently available in the market.
Here are some key features of the text analytics module.
- Parses text and detects meaning with annotators
- Understands the context in which the text is analyzed
- Hundreds of pre-built annotators for names, addresses, phone numbers, among others
- Out of box international support for multiple languages
- Distills structured information from unstructured text
- Sentiment analysis and consumer behavior
This is a powerful, Excel-like platform to explore, manipulate, transform and represent data primarily intended for analysts and requires no prior programming experience.
Behind the scenes BigSheets runs PIG and map reduce scripts to execute data on the underlying Hadoop cluster. Users are able to do joins, filter, unions and various other transformations on data from multiple sources. Final data displays can also be graphical charts. Currently BigSheets supports line, column, bar and pie charts. BigSheets can source data from files (JSON, delimited), all major RDBMS (via JDBC) and Hive.
DataStage provides integration with a broad range of sources.
- Connector integrates BigInsights and the underlying HDFS file system
- Leverages clustered architecture
- The stage mirrors the existing Sequential File Stage, providing similar functionality
- Automated Map/Reduce Job generation.
Integration with a RDBMS
- BigInsights uses Database Import to load data from a RDBMS into a file on HDFS
- Uses Database Export to write data from files to a table in RDBMS
Integration with Cognos Business Intelligence
- Users can easily access unstructured data providing Business Analysts with exposure to the key conclusions found in large volumes of text
- By using the Hive JDBC driver, Cognos Business Intelligence can incorporate data from InfoSphere BigInsights into business intelligence analysis and reports
- Generates Hive QL to query the BigInsights File System
- Metadata from Hive Catalog can be imported into Cognos Framework Manager
- Users can now use a BI modeler to create Cognos reports, dashboards, and workspaces while using the InfoSphere BigInsights MapReduce capabilities
Big Insights provides a robust integrated security framework.
- Single sign on/one-step login where applicable
- Supports security at group/user/document and even field/column level in a data explorer module
- In addition to regular LINUX/UNIX level control to determine access for users and groups at the document/file level, it also supports LDAP, Active Directory
- Supports both early (metadata level) and late binding (query time) for ACL (Access Control Lists) checking
BigSQL is a software layer that enables users to create tables and query data in BigInsights using familiar standard SQL statements.
Big SQL Architecture
The BigSQL query engine supports joins, unions, grouping, common table expressions, and other familiar SQL expressions. Big SQL can read data directly from relational DBMS systems.
Depending on the query, BigSQL can use Hadoop’s MapReduce framework to process various query tasks in parallel or execute the query locally within the BigSQL server on a single node (whichever may be most appropriate for the query). For instance, queries on smaller tables with less data would have unnecessary overhead if the query is going to run map reduce jobs in parallel in the Hadoop system. Instead BigSQL has functionality that queries on one single node as explained above.
My impression during the week as that the above features and functions are impressive. It will be interesting to see if the technology delivers as promised in the real world. We will all be watching.
NOTE: You can also download the entire article in PDF format by clicking here: BigInsightsArticle.