Introduction
Every five to six years, there comes a technology wave, and if you are able to catch it, it will take you a long way. Throughout my career, I’ve ridden several of these waves. MPP data warehouses brought us incredible speed for analytics and a few headaches for data integration. We’re seeing in-memory analytics reducing disk latency. Hadoop based technologies are opening up new solutions every day for storage and compute workloads while our source systems are still generating varying degrees of velocity, volume, and variety.
As a traditional ETL developer, I would usually try to figure out the best solution to acquire, cleanse, and store this data in an optimal format for analytics…usually a data warehouse. Depending on the business need, number of sources, and complexity, this approach is a long one and quite labor intensive. Source systems create new data faster than we can consume them in traditional models. Hence, we see many organizations adopting a Data Lake approach. Here, we are simply concerned with optimizing the acquisition and storage of any data source. We worry about consumption later.
While data federation has been around for years, traditional technologies typically dealt with federating a relational source with a tabular, single file extract. Today, we’re asking federation to handle relational stores, API’s, HDFS, JSON, AVRO, logs, and unstructured text. It’s a tough task, but I was pretty impressed with SAP HANA’s approach and implementation of data federation.
This post is not about SAP HANA, but rather focuses on its data federation capabilities. I will try to explain basics, best practices, and few tips and tricks I came across during my experience working with data federation in HANA.
Smart Data Access
SAP’s data federation capabilities are built in their HANA database, which is known as Smart Data Access (SDA). SDA eliminates the need to replicate data into SAP HANA, instead it lets you query remote sources from HANA. SAP calls this ability to weave in a network of data sources the in-memory data fabric. SDA allows you to create virtual tables, which points to remote tables on remote sources. It allows you to write SQL queries in HANA, which operates on virtual tables. HANA query processor optimizes these queries, and executes only the relevant part of the query in the target database, returns the results of the query to HANA, and completes the operation.
Supported Connectivity
SDA was first introduced in SPS06. Features matured over several releases, and it supports connectivity to ASE, Teradata, IQ, HANA, Oracle, Sql Server, Netezza, DB2, MaxDB and Hadoop. There are just a few one-time setup steps involved when setting up remote sources for first time.
All relational databases can be setup using ODBC drivers and RDBMS drivers on the UNIX server where HANA is installed. Once the drivers are installed, then create the remote sources using HANA studio. Refer to SAP administration guide for version details.
There are few different ways to setup Hadoop remote source. The most common way is to setup using ODBC drivers and HIVE/SPARK drivers on the UNIX server where HANA is installed. Once the drivers are installed, then create the remote source using HANA studio. Other ways include connecting via archive file/virtual UDFs from HANA studio and via spark controller on Hadoop.
SDA Best Practices
Sometimes it is difficult to determine what should be the optimal way to federate data especially dealing with Hadoop sources. We recommend to use divide and conquer method. You can let your remote sources process data and query them from HANA as needed. For ex:- You would push huge volume of data processing on your Hadoop system, this way you are taking advantage of commodity hardware and their cheaper processing power. You can leverage your cheaper storage options and save data in those databases, while only bringing data sufficing your analytical needs into HANA via SDA.
SDA would submit query on remote server, therefore performance will be based on how powerful remote source is configured. It may or may not be adequate for your use case, and you might choose to copy data into HANA instead.
Leveraging Statistics – HANA has the ability to calculate statistics on remote data sources. These statistics can help query optimizer decide how to join two tables including remote tables and in which order to join them. There are two types of statistics you can enable. Histogram type will only saves counts, and simple type will save information such as counts, count distinct, min, max. Depending on your needs you can enable either type to improve your performance.
Querying Hadoop – When federating data from Hadoop, there are few tips and tricks we can use for better performance:
- Remote caching capabilities – All frequently accessed queries on Hadoop system should be cached. HANA provides remote caching capabilities for Hadoop systems, which saves frequently accessed queries into a separate table for faster execution, and avoids executing map reduce job on Hadoop system every time same query gets executed via HANA.
- Using ORC file – Use ORC file for every hive table. Hive supports ORC file, a new table storage format that optimizes speed through techniques like predicate push-down and compression. You might run into issues, when querying table with billion plus records via SDA, this approach resolves it.
- Use of Vectorization – Vectorized query execution improves performance of operations like scans, aggregations, filters and joins, by performing them in batches of 1024 rows at once instead of single row each time
- Cost based query optimization – Cost-based optimization, performs further optimizations based on query cost, resulting in potentially different decisions: how to order joins, which type of join to perform, degree of parallelism and others.
Smart Data Access: Capabilities and Limitations
- Capabilities
- You may create attribute views, analytical views, calculation views, and leverage HANA’s tools and capabilities on remote tables, just like they were in HANA database. This extends ability to execute results using HANA’s calculation engine, which can perform better than normal SQL execution compare to other databases.
- In the latest version of HANA, SDA allows users to insert/update/delete data on remote sources, also SDA will work on certain data types like BLOB/CLOB, which wasn’t allowed in initial version
- Limitations
- HANA is limited to the capabilities of hive for querying Hadoop.