- Release notes
- Getting started
- Installation
- Configuration
- Integrations
- Authentication
- Working with Apps and Discovery Accelerators
- AppOne menus and dashboards
- AppOne setup
- TemplateOne 1.0.0 menus and dashboards
- TemplateOne 1.0.0 setup
- TemplateOne menus and fashboards
- TemplateOne 2021.4.0 setup
- Purchase to Pay Discovery Accelerator menus and dashboards
- Purchase to Pay Discovery Accelerator Setup
- Order to Cash Discovery Accelerator menus and dashboards
- Order to Cash Discovery Accelerator Setup
- Basic Connector for AppOne
- SAP Connectors
- Introduction to SAP Connector
- SAP input
- Checking the data in the SAP Connector
- Adding process specific tags to the SAP Connector for AppOne
- Adding process specific Due dates to the SAP Connector for AppOne
- Adding automation estimates to the SAP Connector for AppOne
- Adding attributes to the SAP Connector for AppOne
- Adding activities to the SAP Connector for AppOne
- Adding entities to the SAP Connector for AppOne
- SAP Order to Cash Connector for AppOne
- SAP Purchase to Pay Connector for AppOne
- SAP Connector for Purchase to Pay Discovery Accelerator
- SAP Connector for Order-to-Cash Discovery Accelerator
- Superadmin
- Dashboards and charts
- Tables and table items
- Application integrity
- How to ....
- Working with SQL connectors
- Introduction to SQL connectors
- Setting up a SQL connector
- CData Sync extractions
- Running a SQL connector
- Editing transformations
- Releasing a SQL Connector
- Scheduling data extraction
- Structure of transformations
- Using SQL connectors for released apps
- Generating a cache with scripts
- Setting up a local test environment
- Separate development and production environments
- Useful resources
Data Volume
The amount of data will always be in a direct trade-off with performance. Process mining is inherently obsessed with details to construct the process graphs.
However, having all these unique timestamps impacts the performance. In general, there are theoretical limits that all process mining tools and all in-memory tools approach.
We make a clear distinction between the performance of the data used for an Application and the Connector. Although they make use of the same platform, there are some differences, i.e. what is acceptable for the users (developers versus end users) and the type of actions performed.
Large amounts of data can have both an impact on the Connector and Application, but all can be solved in the Connector.
The performance end-users will experience is directly related to the data volume. The data volume is determined by the number of rows in the biggest tables. In general, only the number of rows determine the performance end users experience. The number of columns is only a factor when the data is loaded from the database.
Processes with about 5.000.000 (5M) cases and up to about 50.000.000 (50M) events per process would be ideal. With more cases and events parsing the data and showing the visualization will take longer.
The UiPath Process Mining platform will continue to work, however, when large amounts of data are inserted, the reaction speed may drop. It is recommended to check the data amount beforehand. If it exceeds the above numbers, it is advised to consider optimizing or limiting the dataset.
A higher level of detail will take a higher response time which impacts the performance.
The exact tradeoff between the amount of data, the level of detail, and the waiting time needs to be discussed with the end users. Sometimes historical data can be very important, but often only the last few years are needed.
*.mvn
files to a minimum. This works well for values that are similar. A lot of unique values for an attribute will also impact
performance e.g. event detail.
There are two main solution directions for dealing with large data volumes:
- optimization;
- data minimization.
Optimization involves the adjustments Superadmins can make to make the dashboards render faster, which can be achieved by tailoring the application settings to the specific dataset (see Application Design for more information).
This section describes data minimization, which are the different techniques you can employ to reduce the data visible to the end user, tailored to the specific business question.
The techniques described here can exist alongside each other or can even be combined to leverage the benefits of multiple techniques. In addition, you may keep an application without data minimization alongside minimized applications because the level of detail might sometimes be required for specific analyses where slower performance is acceptable.
Limiting the number of records that will show up in the tour dataset will not only improve the performance of the application, it will also improve the comprehensibility of the process and in turn, improve acceptance by the business.
The scoping of the data can be done in the Connector.
One of the options for scoping is to limit the time frame to look by filtering out dates or periods. For example, you could limit the timeframe from 10 years to one year. Or from 1 year to one month. See the illustration below.
A limited amount of activities is advised, especially in the start of any process mining effort. From there you can build up as the expertise starts to ramp up.
Below is a guideline for the range of activities:
Range (nr. of activities) |
Description |
---|---|
5-20 |
Preferred range when starting with process mining. Simple process to give insight information. |
20-50 |
Expert range. Expanding with clear variants. |
50-100 |
Most useful if there are clear variants. This means somewhat related processes, but primarily on their own. |
100+ |
Advised is to split up into subprocesses. |
Below are some suggestions for filtering data:
- Unrelated activities: activities that are not directly impacting the process could be filtered out.
- Secondary activities: some activities, i.e. a change activity, can happen anywhere in the process. These significantly blow up a number of variants.
- Minimally occurring events: events that occur only a few times in your dataset could be filtered out.
- Smaller process: only analyze a subprocess.
- Grouping activities: some activities in your dataset may be more like small tasks, which together represent an activity that makes more sense to the business. Grouping them will require some logic in the connector and may result in overlapping activities.
- If possible, within the performance of the Connector, use the Connector to filter out activities. In this way, any changes can be reverted easily, or activities can be added back. Avoid filtering out activities in the data extraction or data loading.
If there is one case with a lot of events (outlier), it will impact some expressions which calculate aggregates on the event level. The from/to dashboard item filter is impacted by this and can be time-consuming to calculate if you have these outliers. It is recommended to filter out these cases in the Connector to take them out of the dataset.
In other instances, the outliers may be the key area to focus on. If your process is going well or you adopt Six Sigma methodologies, you want to focus on the things going wrong. Instead of showing all the cases going right, you only show the cases going wrong.
See the illustration below.
In the Connector, you can remove attributes that have a lot of detail. For example, long strings in the Event Detail attribute.
When finished developing a lot of unused attributes may end up in your dataset. It is recommended to only set the availability of the attributes that are used in the output dataset of the Connector to the public. Set the availability of other attributes to private.
Pre-aggregation is a technique that is employed by many BI tools to gain insights into large data volumes. It involves aggregating data over specific attributes to reduce the number of records in a dataset. In BI this would typically be summing the value of each supplier, so only have one record for each supplier.
See the illustration below.
Process mining requires more configuration, but a starting point is to only aggregate on process variants. For each variant you would have one case record and a related number of events. This can significantly reduce the data volumes.
To show correct results you would also have to show how many records each variant represents, for the event ends you could use a median duration of each event. Aggregating only using variants might be too high so it would be wise to check most common filters used, e.g. a combination of variants, case type and month of the case end (to show trends over time).
However, adding attributes has a quadratic effect on the number of records so this requires a careful balance between performance and use case.
Pre-aggregation is most applicable for an overview of your process and spotting general trends.
Sampling is a technique where you take a percentage of the cases and their events happening in a specific period. You can for instance set that only 10% of all cases and their events are shown. In this way you still have exceptions or outliers since each case has a similar chance of showing up in the dataset.
See illustration below.
Cascaded sampling is a technique where the sampling percentage drops over time with a certain percentage. An example of this shows 100% of last week’s data, 90% of two weeks ago, 80% of three weeks ago, and so on.
Data sharding is a technique of the data scoping solution, which allows organizations to split up the data into multiple datasets, rather than just slicing off one part. This setup does require additional configuration since the application needs to be split up by using modules and multiple smaller dataset need to be exported from the connector.
With data sharding, the original dataset is divided into multiple shards. The smaller each shard is, the faster it will be. When a user logs in to the application, only the applicable data shard will be loaded.
A typical unit for sharding would be “Company code” or “Department”. For example, in the case of 50 company codes, each shard will contain one company code, and essentially be about 50 times faster than the original dataset.
See the illustration below for an overview of sharding.