Dataset Building Guide for Envision
Guidelines for designing datasets for Envision.
Review these guidelines before you start building datasets, so that you are informed about how to balance information detail and performance efficiency to represent the information that's important to you in a way that will be timely and resource-efficient.
Note: This document provides information to help you make design decisions for using Envision. For step-by-step instructions for creating the datasets, see Creating a Dataset.
Supported Envision Versions: 2020.1.1
Table of Contents
Overview
When setting up Envision, datasets are the first consideration. The choices made during dataset creation will impact the size of your project and how efficient it is in showing you the information you need.
When you're collecting and representing data, the volume of data points can get large very quickly. Overhead includes the relay, presentation, and storage of the information. More information is great, but it's important not to burden your system with too much data processing. Consider:
- Dimensions
- Metrics
Dimensions
Dimensions allow you to correlate and compare metrics in a context. Each dimension that you add will multiply the required storage space and the amount of data analyzed for each charting query. Carefully consider the objective of each dataset to choose appropriate dimensions.
In this section:
Dependent or Independent Dimension?
To keep the cost of maintaining the dataset low, it is best to add only dependent dimensions required to accomplish the use case.
When dimensions are independent, consider separating into separate datasets to reduce data requirements.
Example
For example, let's consider a use case where you have two independent objectives. You want to:
- Track sales trends by product feature.
- Track lead source efficacy for each product.
Tracking sales metrics with all dimensions in a single dataset (Product, Feature, Lead Source) would result in the following multiplication of data points for each time interval:
Product # * Feature # * Lead Source #
Instead, you could separate this into two different datasets:
- (Product, Lead Source)
- (Product, Feature)
This results in:
Product # * Feature #
And:
Product # * Lead Source #
Assume you have 1000 products, 100 features, and 10 lead sources.
In the first example, a single dataset, your data points would be 1000 * 100 * 10.
Total: 1,000,000 data points.
In the second example, two separate queries, your data points would be:
- First dataset: 1000 * 100 (100,000 data points)
- Second dataset: 1000 * 10 (10,000 data points
Total: 110,000 data points
When the same information is represented in two separate datasets, the number of data points is 110,000 rather than 1,000,000. That is an order of magnitude less data.
Empty Dimension (Implied/Core)
There is always an empty/implied/core dimension which is defined by the event that triggers a metric to be collected. For example, when tracking orders, the event of a customer making an order triggers the collection of various metrics and dimensions. There is no explicit dimension defined for the order, nor transaction, because the dataset implies that every event is an order/transaction. It is important NOT to explicitly specify an implied dimension, because it will cause too much data to be accumulated.
Time Series Dimension
Everything we track and measure is related to time. A time dimension is assumed/built-in for every dataset. Therefore, you should not define a dimension for your dataset based on time. Doing so will dramatically increase the storage and computing requirements, and it is not necessary. The time an event occurred is aggregated into a time window from the real-time event's timestamp property.
Metrics
Capture as many metrics as you can use. Metrics are cheap facts that get accumulated over time intervals as real-time events are aggregated. You cannot search/filter charts by metrics, but you'll be able to compute the following metric values for any dimension over any time interval:
- Min
- Max
- Sum
- Average
Occurrence (Request Count)
Every dataset inherently collects the count or occurrence of an event (empty/implied/core dimension). There is no need to map nor define a custom dimension to count the occurrence.
For example, if your dataset is tracking orders, you'll always have access to the number of orders accumulated over a time interval, via the Request Count metric. However, if you want to know the quantity of items in an order, you'll need to define a custom metric with min, max, sum, and/or average accumulators.