Introduction
Netezza is a data warehousing appliance that uses an Asymmetric Massive Parallel Processing Architecture. The Netezza architecture is driven by two fundamental principles- process close to the data source and do not move data unless absolutely necessary. Among the two principles, the latter is implemented primarily by commodity hardware Field Programmable Gate Arrays (FPGAs). FPGA plays the pivotal role in filtering out data as soon as possible, removing I/O bottlenecks, freeing up valuable downstream components such as memory and processor. Zone Maps are internal data structures of Netezza that enables FPGAs to filter out data. In simplistic terms, a Zone Map is a persistent table maintained in Netezza that contains information about tables created by users. Zone Maps store the maximum and minimum values of columns corresponding to pages that store data of a table. Unlike traditional data structures such as index that helps determine where exactly you should look for data, Zone Maps help determine where to not look for data. I find this concept very fascinating and analogous with a strategy to answer a popular question that all of us have faced at some point of time – ‘What do you want to be in this world when you grow up?’ I believe the easiest way to start answering this question is to answer the question ‘What do you not want to be in this world’?
This article addresses two questions that I was curious about
- How can I know if Zone Maps are used in my query?
- Do I need a completely de-normalized data model for effectively using Zone Maps?
Utilization of Zone Maps
In order to answer these questions, I considered a simple data model consisting of a fact table (FACT_AR_LEDGER) and a dimension table (INVOICE). The primary key of the dimension table (INVOICE_KEY) exists as a foreign key in the fact table. BILLING_CUSTOMER_KEY and INVOICE_STATUS are columns present in the dimension table that do not exist in the fact table. These tables were not ordered in any particular fashion and were distributed in round robin fashion. Statistics was run on both tables prior to the execution of the test cases below.
Query 1- Retrieve 1000 rows from fact table with where clause on column present in fact table
SELECT INVOICE_STATUS FROM FACT_AR_LEDGER a JOIN INVOICE b ON a.INVOICE_KEY=b.INVOICE_KEY WHERE a.INVOICE_KEY=151854 LIMIT 1000
Query 2- Retrieve 1000 rows from fact table with where clause on column not present in fact table
SELECT INVOICE_STATUS FROM FACT_AR_LEDGER a JOIN INVOICE b ON a.INVOICE_KEY=b.INVOICE_KEY WHERE b.BILLING_CUSTOMER_KEY=6364340 LIMIT 1000;
Note: In Query 2, Billing Customer (Billing_Customer_Key=6364340) is only linked to one Invoice (INVOICE_KEY=151854). Hence, the target record counts for both queries are same. Both queries have also been limited to the first 1000 records.
A portion of the Explain Plan for Query 1 corresponding to the sequential scan of the fact table is shown below.
Node 2. [SPU Sequential Scan table "FACT_AR_LEDGER" as "A" {}] -- Estimated Rows = 446058, Width = 59, Cost = 0.0 .. 41673.0, Conf = 90.0 [BT: MaxPages=32557 TotalPages=714725] (JIT-Stats) Restrictions: (A.INVOICE_KEY = 151854) Projections: 1:A.INVOICE_STATUS 2:A.INVOICE_KEY Cardinality: A.INVOICE_KEY 1 (JIT)
As per the Explain plan, you will notice the optimizer utilizes Zone Map information in JIT stats which is reflected in the MaxPages count (32557) which is significantly lower than the TotalPages (714725).
A portion of the Explain Plan for Query 2 corresponding to the sequential scan of the fact table is shown below
Node 2. [SPU Sequential Scan table "FACT_AR_LEDGER" as "A" {}] -- Estimated Rows = 766644778, Width = 59, Cost = 0.0 .. 174918.1, Conf = 90.0 (FACT) Projections: 1:A.INVOICE_STATUS 2:A.INVOICE_KEY Cardinality: A.INVOICE_KEY 170.3K (JIT)
As per the Explain Plan of Query 2, it was observed that there was no restriction to the pages to be scanned which may suggest that Zone Maps are not utilized for this query. Our analysis aligns well with basic logic that when a fact table is scanned, only Zone Maps corresponding to columns present in the table will be utilized. In addition, the cost comparison for sequential scan for FACT_AR_LEDGER is lower for Query 1 in comparison with Query 2 due to the use of Zone Maps in Explain Plan of Query 1.
Before concluding that Zone Maps were not utilized in Query 2, let us take a moment to compare the Execution Plan Files of both these queries.
Plan File – Query 1
-- Object "/nz/data.1.0/plans/3388915/s3388915_1.o" from cache "/nz/data.1.0/cache/287/3o" -- Object "/nz/data.1.0/plans/3388915/h3388915_501.o" from cache "/nz/data.1.0/cache/143/0o" -- Snippet 3388915_1 exec 0.006 @ 0% res 0.007 channels 1 hmem 1 mem 16 exp h/s 0%/0% -- Detail 3388915_1 table 33757/33766.09/33776 zmread 2 zmhits 2 #22 hcpu 0.002 cpu 0.003/0.020 dread 0.003 mem 3 /nz/kit.7.2.0.4-P1/sys/cc/bin/i686-mcpnps-linux-gnu-g++ -march=prescott -mfpmath=sse -include inclall_1.h -fPIC -o /nz/data.1.0/plans/3388915/s3388915_2.o -shared -O2 -fno-strict-aliasing -fexceptions -fsigned-char -Wno-invalid-offsetof -DNZDEBUG=0 -DGENCODE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -DFOR_SPU -I/nz/kit.7.2.0.4-P1/sys/include -I/nz/data.1.0/plans.include/genpch /nz/data.1.0/plans/3388915/s3388915_2.cpp -- Object "/nz/data.1.0/plans/3388915/s3388915_2.o" compiled dur 0.188 complexity 3733 /nz/kit.7.2.0.4-P1/sys/cc/bin/i686-rhel5-linux-gnu-g++ -march=prescott -mfpmath=sse -include inclall_4.h -O -fno-strict-aliasing -fPIC -shared -Wa,--32 -o /nz/data.1.0/plans/3388915/h3388915_502.o -DNZDEBUG=0 -DGENCODE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -I/nz/kit.7.2.0.4-P1/sys/include -I/nz/data.1.0/plans.include/genpch /nz/data.1.0/plans/3388915/h3388915_502.cpp -- Object "/nz/data.1.0/plans/3388915/h3388915_502.o" compiled dur 0.155 complexity 1798 -- Snippet 3388915_2 exec 0.958 @ 100% res 1.588 channels 1 hmem 295 mem 17 estrows 2 exp h/s 6%/33% priority -- Detail 3388915_2 table 33757/33766.09/33776 scan 32402/32487.50/32557 hits 6/6.95/8 zmread 3 zmhits 3 #22 hcpu 0.350 cpu 0.018/0.040 dread 1.220/1.333 fpga 0.593/0.885 mem 6 temp 1 ----------------
As shown in the plan file, there is a reduction in the scan counts (32402/32487.50/32557) in comparison with the table counts (33757/33766.09/33776) which supports our preliminary analysis that Zone Maps were used during the execution of this query.
Plan File – Query 2
-- Object "/nz/data.1.0/plans/3580088/s3580088_1.o" from cache "/nz/data.1.0/cache/287/3o" -- Object "/nz/data.1.0/plans/3580088/h3580088_501.o" from cache "/nz/data.1.0/cache/143/0o" -- Snippet 3580088_1 exec 0.003 @ 0% res 0.001 channels 1 hmem 1 mem 16 exp h/s 0%/0% -- Detail 3580088_1 table 33757/33766.09/33776 #22 hcpu 0.001 mem 3 -- Object "/nz/data.1.0/plans/3580088/s3580088_2.o" from cache "/nz/data.1.0/cache/137/6bo" -- Object "/nz/data.1.0/plans/3580088/h3580088_502.o" from cache "/nz/data.1.0/cache/52/4ao" -- Snippet 3580088_2 exec 0.950 @ 100% res 1.271 channels 1 hmem 295 mem 17 estrows 1000 exp h/s 6%/33% priority -- Detail 3580088_2 table 33757/33766.09/33776 scan 32402/32487.50/32557 zmread 1 zmhits 1 #22 hcpu 0.002 cpu 0.057/0.080 dread 1.212/1.323 fpga 0.668/0.992 mem 6 temp 1
It was noticed as per the plan file, there was a reduction in the scan counts (32402/32487.50/32557) in comparison with the table counts (33757/33766.09/33776) which makes our preliminary analysis wrong. We assumed Zone Maps will not be used during execution of Query 2. Netezza was smart enough to identify INVOICE_KEY corresponding to the filter (BILLING_CUSTOMER_KEY=6364340) on the dimension table (INVOICE) and utilize this information via Zone Maps on the fact table (FACT_AR_LEDGER). In other words, Zone Maps were utilized via the join between the fact table and dimension table on the common key- INVOICE_KEY.
As stated earlier, data was not ordered in any particular fashion for the execution of the test cases; Nevertheless, Zone Maps are most efficient when data for tables are ordered on the column used for restriction. Date based columns that are used frequently in filtering conditions will be very suitable for ordering data in tables. In fact, many a time, data from sources will already be arriving in an ordered fashion based on the date. Moreover, as seen in the test cases above, filters can be applied effectively on any column in the date hierarchy without propagating the columns into the fact table. If it is identified that multiple columns are prime candidate columns for filters, the corresponding table can be converted to a cluster based table (CBT) which enables multiple candidate columns to be give equal precedence for ordering of data.
It is also important to keep in mind that Zone Maps are created only for integer, date and timestamp columns. Hence, it is ideal to convert character based natural keys to integer based surrogate keys in a Netezza environment. Zone Maps are only created for character columns if they are used in the order by clauses for materialized views (first 8 bytes).
Summary
- Explain Plan information is not enough to conclude whether Zone Maps are used in the execution of a query. The statistics obtained from the corresponding Plan Files of queries give us a better understanding about the use of Zone Maps in a query.
- When querying fact table, if user wants to restricts data based on dimensional columns (that are eligible for zone maps), Netezza utilizes zone maps even if those columns are not present in the fact table. So it is not necessary to propagate columns to fact table just for the purpose of zone maps.
- Netezza is a cost based optimizer. The cost for sequential scan of a fact table when the query is restricted based on a column present in the fact table is significantly lower in comparison with a query restriction based on a dimension column not present in the fact table. Due to the additional join, cost of query goes up significantly. Hence, if there are additional columns in the dimension table that do not change over a period of time and are target columns for restrictions/filters, add these columns to fact tables.
- The efficient use of Zone Maps depends on several factors such as ordering of data, data types and up-to-date statistics.