What is a Tableau Data Extract?
A Tableau data extract is a compressed snapshot of data stored on disk and loaded into memory as required to render a Tableau viz. The data extracts can be scheduled on the Tableau Server to refresh automatically.
Why use Tableau Data Extracts
1. Performance: Data extracts in general provide increased performance.
2. Reduce load on back-end systems: Replacing live connection to OLTP database or any other database reduces the load on the database.
3. Pre-aggregations: An aggregated dataset is much smaller than the transaction level data extract which improves query performance.
4. Materialization of calculated fields: When you optimize a Tableau extract, all of the calculated fields are pre computed and stored as static values in extract.
Tableau Data Extract Types
Single table extracts: The data is stored in file as a single table structure.
Multiple table extracts: All the tables being used in the data source are stored separately in the extract file, mirroring the database structure. In certain cases, this will result in smaller extract file sizes, faster extract creation and potentially faster queries.
How to choose between Single table and Multiple table extract
- Single table storage (the default storage type) and multiple table storage each have better file size and performance characteristics in different scenarios, so we allow you to choose.
- The storage type affects file size because certain types of joins cause data storage redundancy. If the number of rows after your join is larger than the sum of the rows in your input tables, then your data source is a great candidate for multiple table storage. Joins that are likely to cause data storage redundancy include joins between fact tables and entitlement tables in some row-level security scenarios.
- In addition to file size differences, multiple table storage and single table storage can affect extract creation speed and visualization query speed. For single table storage, your source database will perform the join during extract creation. With multiple table storage, however, Tableau Desktop will perform the join inside Tableau’s data engine during visualization query time. So, multiple table storage extracts may initially be created faster because they only require copying the individual tables, without requiring a join. On the other hand, multiple table storage extracts might be slower during query time because of the join required at that time.
- These performance differences are more noticeable with large amounts of data. If you are working with a large data set, you'll want to experiment with both techniques to determine which gives you the best performance and size benefits.
- If you can’t decide which to use, stick with the default, single table storage, because multiple table storage has some functional limitations—including no incremental refresh and no extract filters.
How to improve extract performance
1. Reduce the amount of data: When you create an extract, use filters to exclude data that you don’t need.
2. Hide unused fields: Hidden fields are not included when create an extract. This makes the extract smaller, which improves performance.
3. Optimize extracts: This results in materializing calculations in your extract. Depending on the complexity of the calculations, can help with query performance.
4. Use extracts for file-based data: In general it’s best practice to import file-based data—text files such as CSV, Microsoft Excel spreadsheets, and Microsoft Access files, as well as statistical files like SPSS, SAS, and R—into Tableau. This makes queries perform much faster and also results in a much smaller file to store the data values.
Full vs. Incremental extract refresh
Full extract refresh: This means that every time you refresh the extract, all of the rows are replaced with data from the original data source.
Incremental extract refresh: Incremental refresh only adds new data rows from the original data source, it does not update any existing data. You need to specify a column in database to be used to identify new rows. If your data is not being update frequently then you can use incremental refresh on a daily basis and weekly or monthly you can do a full refresh.
How to create an extract?
1. After you connect to your data and set up the data source on the Data Source page, in the upper-right corner, select Extract, and then click the Edit link to open the Extract Data dialog box.
2. (Optional) Configure one or more of the following options to tell Tableau how to store, define filters for, and limit the amount of data in your extract:
Decide how the extract data should be stored
You can choose to have Tableau store the data in your extract using one of two structures (schemas): single table (denormalized schema) or multiple tables (normalized schema).
• Single table
Select Single table when you want to limit the amount of data in your extract with additional extract properties like extract filters, aggregation, etc.; or when your data uses pass-through functions (RAWSQL). This is the default structure Tableau uses to store extract data. If you use this option when your extract contains joins, the joins are applied when the extract is created.
• Mutliple tables
Select Multiple tables if your extract is comprised of tables combined with one or more equality joins and meets the Conditions for using the "Multiple tables" option listed below. If you use this option, joins are performed at query time.
Conditions for using the "Multiple tables" option
To store your extract using the "Multiple tables" option, the data in your extract must meet all of the conditions listed below.
- All joins between tables are equality (=) joins
- Data types of the join columns are identical
- No pass-through functions (RAWSQL) used
- No incremental refresh configured
- No extract filters configured
- No top N or sampling configured
When the extract is stored as "Multiple tables," you cannot append data to it.
Note: Both the "Single table" and "Multiple tables" options only affect how the data in your extract is stored. The options do not affect how tables in your extract are displayed on the Data Source page.
- Determine how much data to extract
Click Add to define one or more filters to limit how much data gets extracted based on fields and their values.
Aggregate the data in the extract
Select Aggregate data for visible dimensions to aggregate the measures using their default aggregation. Aggregating the data consolidates rows, can minimize the size of the extract file, and increase performance.
When you choose to aggregate the data, you can also select Roll up dates to a specified date level such as Year, Month, etc. The examples below show how the data will be extracted for each aggregation option you can choose.
• Choose the rows to extract
Select the number of rows you want to extract.
You can extract All rows or the TopN rows. Tableau first applies any filters and aggregation and then extracts the number of rows from the filtered and aggregated results. The number of rows options depend on the type of data source you are extracting from.
Notes:
- Not all data sources support sampling. Therefore, you might not see the Sampling option in the Extract Data dialog box.
- Any fields that you hide first in the Data Source page or on the sheet tab will be excluded from the extract. Click the Hide All Unused Fields button to remove these hidden fields from the extract.
3. When finished, click OK.
4. Click the sheet tab. Clicking the sheet tab initiates the creating of the extract.
5. In the subsequent dialog box, select a location to save the extract, give the extract file a name, and then click Save.
Performance Improvement Checklist
STAR Schema
Star schemas are the most efficient for use in Tableau especially when working with large data sources. Tableau may have performance issues when there are many levels of joins.
Limit Tables
This will improve performance. Too many tables joined with Tableau will affect the performance when using the developer interface. If you have a lot of tables to join, consider creating a view or reporting table first.
Limit Rows
We strongly recommend that the number of rows not exceed 10 million. If you need more, we need to assess the impact to the server and performance.
- LIMIT BASED ON COLUMN VALUE. Some rows may be more useful to have than others. Reduce the amount of data by splitting the data set up into subjects e.g. departments.
- ACTIVE ROWS: Current / Active Rows Only
- LIMIT YEARS. Limiting years will help reduce creation of hyper-extracts.
- ELIMINATE HISTORICAL DATA
Skinny Down - Unused Columns
- Performance is impacted both by number of rows and number of columns.
- ROW LENGTH LIMIT. Tableau has a limit to the length of the rows which is based on the sum of the expected length of each column so it is not easy to assess the number of columns that will cause a hyper-extract build to fail. However, varchar fields even though they may have less actual data length than the field size, it seems Tableau when building the extract will size based on what is expected from the field length.
- NULL VALUES. Eliminate columns that have all null values.
- ANALYTIC VALUE. Eliminate fields that have little analytic value e.g. free-text fields. Work with users to ensure columns included will add value.
- KEYS. Remove Key Fields used in joins.
- PHI/PII. Limit PHI / PII in the hyper-extracts will help with security also, e.g. patient email, phones, addresses except maybe city, state, zip which can be used for geo-coding in Tableau.
Create Summary Views/Tables
- Tables like BillingTransactionFact are extremely large, but contain valuable charge data. Recommend creating summary Fact Table Clusters by department, provider, etc. over time and summing the dollar amounts and counts by Month Year.
Limit Update Frequency
- REDUCE UPDATE FREQUENCY. Extracts can be scheduled to be updated automatically. These updates impact the updates across the Tableau server. Work with the Tableau team to test and assess the impact of any hyper-extract build that will be put into production. Consider what is really needed for update frequency when building the extract.
Security Checklist
Service Area Security
- For APeX data to determine if we need to exclude some service areas (e.g. other hospitals, groups that use the UCSF APeX environment) and the best way to do that e.g. filtered views or hard coded criteria.
PHI/PII Security
- Columns can be excluded from the hyper-extract but not dynamically. If we create non-PHI/PII versions of extracts that will double the creation time. Recommend using views to limit these fields.
Development/Validation Considerations
Consider doing data prep prior to developing a hyper-extract so will not have to do joins and filtering in Tableau which can be cumbersome and slow if there are many joins.
Reporting tables or views can be created to pre-join the data which provides some flexibility in updating a Tableau Hyper-extract as new fields can be added without re-publishing the extract. Views and reporting tables can be a great way to standardize column names and the look and feel of the hyper-extract. There are some items to consider when using reporting tables and views:
View/Reporting Table Development Guidelines
Naming Conventions
Consider amending the field name in the select statement to have a prefix to distinguish lookup column names such as columns named just “code”. The below example adds the word arrival to the lookup column for DayOfWeek to distinguish it from other DayOfWeek fields for discharge date, etc.
Create table mnemonic so that the result column name in Tableau is user friendly e.g. change EViFact to EVis and add it to the column name so in the Tablueau extract the source table can be distinguished. This may be needed if you have join tables and column across the tables with the same name such as “DATE” or “CODE”.
Re-Cast Numbers
Tableau will treat any field that is a number as a measure. If you have identifiers that you do not want to show up as measures, recast or convert them to text in the view or reporting table.
Column Limits
- Reduce columns displayed in the SELECT statement or reporting table. Limit for view is 1024 in SQLServer. Remember, Tableau has a row length limit.
Remove Columns
- Hide Key Fields exception would be key fields that have meaning and would be useful e.g. HAR, DateKey
- Filter years in the view. This will improve performance.
- Filter ServiceArea – needed for Caboodle
- Hide PHI/PII where appropriate
- Filter out historical, deleted, etc. records
- Split Fact tables where appropriate – this may be better done in Tableau so that only one view is created. Tableau has a filter option in the developer tool, but it the table is large it may be better to filter in the view. Too many rows can bring down the Tableau server when building a hyper-extract.
- Sum amounts for summary tables
- Careful using vendor-maintained views as you won’t have control over changes. In Caboodle, don’t use EPIC views as these will be changed automatically with updates and we want to have controlled changes in the hyper-extracts.
Comment Out Not Needed Columns
- Recommend including all fields in the view or table DDL code from the source tables and comment out the ones not used so that they are easily identified for future consideration:
Performance Load Test
Some testing of the final entire extract process will need to be done to ensure other extracts are not negatively impacted by the hyper-extract build. Please work with the Tableau team for this.
Tableau Interface
If you do not use views or reporting tables and do the joins in Tableau, then if you need to join to the same table more than once, Tableau will present the field names.
FAQs
1. How to improve data extract performance?
Keep the data extract size as small as possible by hiding unused columns and aggregating the data to highest level possible. Bring in only what you need.
2. Is there a max limit on number of columns?
There is no upper limit defined by Tableau for data extracts, but performance degrades as number of columns grows.
3. What is incremental refresh?
Incremental refresh adds new rows to the existing data in the extract based on a key. It does not update any of the existing records in the extract.
For any questions or suggestions, please e-mail to [email protected]