Most analytics solutions on the market today can now access many types of data sources, including modern systems like Hadoop. It’s a mandatory requirement these days, because most enterprises (probably all enterprises) store useful data in all sorts of datastores throughout the organization. The popular analytics platform, Tableau, is one such tool that can combine data from disparate sources, reducing the amount of data that must be ETL’d around just to satisfy a report request. When combining tables from these different connections, the tool itself will perform the joins and aggregations, which, depending on the data volume and query, can force a massive load on application server or user machine.
When combining disparate data sets, for example customer information from the CRM system in an Oracle RDBMS with IoT or website data, this approach often doesn’t work. Too much data stored on the Hadoop side make data extract and load into the RDBMS impossible (or very challenging, to say the least). Often times, this Hadoop-first bit of information is a large missing piece of the puzzle for analytical dataset required by the business analysts. Let’s have a look at our options within Tableau to see how to best approach this challenge.
Those of you who are familiar with Tableau know it can connect to multiple data sources at once, allowing you to use a combined logical “query” to access and return data from any of the data sources. This is a powerful feature in a tool known for its ad hoc reporting capabilities.
But it’s not always the best approach to data consolidation and querying disparate data sources. Often, the dataset is too large to perform a logical join. Another option is to colocate the data, either in the RDBMS or in Hadoop. While it makes sense to load the Oracle data to Hadoop, “filling the data lake”, we introduce new challenges. Once the data is moved away from the RDBMS, you lose the capabilities of that database and it’s advanced query language. The Oracle database, for example, just turned 40 years old while Impala, a top sql on Hadoop engine, just recently turned 4 years old. That’s nearly 10 times the years of experience, learning from mistakes, and pure advancements in technology. Not to say Impala, and other SQL engines on Hadoop, are not advanced (they’ve moved quite fast for how young they are, credit to the open source community), but they still have a lot of catching up to do when it comes to advanced features.
For example, if I want to add an additional datasource to the query below, say, to calculate the website hits grouped and rolled-up by country and state/province, I must revert out of a custom SQL mode and figure out how to make the query work graphically.
"COUNTRIES"."COUNTRY_NAME" AS "COUNTRY_NAME",
count(distinct "CUSTOMERS"."CUST_STATE_PROVINCE") STATE_PROVINCE_CNT
FROM "SH_DEMO"."COUNTRIES" "COUNTRIES"
INNER JOIN "SH_DEMO"."CUSTOMERS" "CUSTOMERS"
ON ("COUNTRIES"."COUNTRY_ID" = "CUSTOMERS"."COUNTRY_ID")
GROUP BY ROLLUP ("COUNTRIES"."COUNTRY_NAME")
ORDER BY "COUNTRIES"."COUNTRY_NAME";
<img class="alignnone size-large wp-image-2165" src="https://gluent.com/wp-content/uploads/2017/08/tableau-custom-sql-oracle-hadoop-1024x433.png" alt="Custom SQL against Oracle and Hadoop. Error!" width="1024" height="433" />
I can create the rollups, grouped by country and state/province in the report designer, but that forces the application to perform the processing rather than the database. And this approach has the potential for slowing performance and increasing my memory/cpu consumption on the application side, as previously noted. Not the ideal situation for a business analyst.
Our other option is to virtualize the Impala table and make it accessible from within Oracle. We can do this using the Present functionality of the Gluent Data Platform. No code to write, no ETL streams, no data movement, and no changes to your existing Tableau reports. Presenting a table from a Hadoop data source as an external table in your Oracle database is the quickest way to access Hadoop first data in real-time. Once the data is available in the Impala table, it’s available in Oracle.
After installation and configuration of the Gluent Data Platform, the Gluent Present command is enabled. Run the command from either the Oracle database server or the Hadoop edge node. There are many additional options that can be used, but in this case we’ll keep it simple.
<span style="font-weight: 400;">./present -t sh_demo.website_data -vx</span>
The table is created in a hybrid schema, in this case sh_demo_h, on the Oracle database. If I wanted to keep it even more simple, I can create a synonym which points to the “presented” external table from my original schema. For this example, I’ll just add a new connection to the hybrid schema in Tableau. After adding the datasource in my Tableau Connections list, voila! We have our Hadoop based table, WEBSITE_DATA, now available. I can create the same query I attempted to above, and also continue to use the Oracle SQL analytic functions GROUP BY ROLLUP as I had planned to originally.
All of this completed without moving a single row of data or writing a line of ETL code, and without importing millions of records to my local desktop for aggregation via Tableau. When attempting to combine data from multiple data sources, especially when one or more tables are very large or advanced RDBMS SQL capabilities must be used, have a look at Gluent Present.