document-understanding
2023.10
true
Document Understanding User Guide
Automation CloudAutomation Cloud Public SectorAutomation SuiteStandalone
Last updated Oct 17, 2024

Deploying high performing models

As Machine Learning (ML) models improve in accuracy over time, their resource requirements change as well. For the best performance, it is important that when deploying ML models via AI Center, the skills are appropriately sized with respect to the traffic they need to handle. For the most part, infrastructure is sized with respect to the number of pages per unit of time (minute or hour). A document can have a single page, or multiple pages.

Introduction to ML model performance

To deploy infrastructure via AI Center, there are a few important aspects to keep in mind for optimal performance.

GPU

There is only one type of GPU infrastructure available. This is highlighted by the checkbox to enable GPU. Each skill runs on a single virtual machine (VM) or node that has a GPU. In this case, CPU and memory are not relevant, since the skill can use all the available CPU and memory resources on that nodes. Besides throughput, GPU is much faster. Because of this, if latency is critical, it is recommended to use GPU.

CPU

CPU and memory can be fractioned, which means multiple ML Skills can run on the same node. To avoid any disturbance from a neighboring skill, each ML Skill is limited to the amount of memory and CPU they can consume, depending on the selected tier. Higher CPU leads to faster processing (for a page), while higher memory leads to a larger number of documents that can be processed.

Number of replicas

The number of replicas determines the number of containers that are used to serve requests from the ML model. A higher number leads to a larger amount of documents that can be processed in parallel, subject to the limits of that particular tier. The number of replicas is directly tied to the infrastructure type (number of CPUs per replica, or if using a GPU), in the sense that both replicas and infrastructure size can directly affect throughput (pages/minute).

Note: Multiple replicas multiply the throughput.

Number of robots

The number of robots impacts throughput. To get efficient throughput, the number of robots needs to be sized in such a manner that it does not overload the ML Skills. This is dependent on the automation itself and should be tested. As a general guideline, you can use one to three robots as a starting point for each replica the ML Skill has. Depending on the overall process time (excluding ML Extractor), the number of robots can be higher or lower (or the number of replicas).

Potential issues related to infrastructure sizing

If the infrastructure is not sized correctly, the models can be placed under a very high load. In some cases this can lead to a backlog of requests, long processing time, or even failures when processing documents.

Insufficient memory

Insufficient memory most commonly encountered on the lower CPU tiers (0.5 CPU or 1 CPU). If you need to process a very large payload (one or several large documents), it can lead to an out of memory exception. This is related to the document size in terms of pages and text density (how much text there is per page). Since the requirements are very specific to each use case, it is not possible to provide exact numbers. You can check the guidelines in the Sizing the infrastructure correctly section for more detailed information. If you encounter an insufficient memory situation, the general recommendation is to go to the next tier.

Insufficient compute

Insufficient compute refers to both CPU and GPU, although it is more commonly encountered on CPU. When the ML Skill receives too many pages related to its available capacity, requests can timeout (520 and 499 status codes), backlog, or even lead to the model crashing (503 and 500 status codes). If you encounter an insufficient compute situation, we recommend going to the next tier, or even to the GPU tier.

Size the infrastructure correctly

General guidelines

This section provides general guidelines on how the models perform on each different skill size.

Note: Each model generation (2022.10, 2023.4, or 2023.10) behaves differently in relation to resources required and throughput. As models become better in terms of accuracy, this can also impact the performance and demand more resources.
Table 1. 2022.10 Extractor
TierMaximum pages/documentExpected throughput (pages/hour)AI Units/hour
0.5 CPU/2 GB memory25300-6001
1 CPU/4 GB memory50400-8002
2 CPU/8 GB memory100600-10004
4 CPU/16 GB memory100800-12008
6 CPU/24 GB memory100900-130012
GPU200-2501350-160020
Table 2. 2023.4 Extractor
TierMaximum pages/documentExpected throughput (pages/hour)AI Units/hour
0.5 CPU/2 GB memory2540-1001
1 CPU/4 GB memory5070-1402
2 CPU/8 GB memory75120-2204
4 CPU/16 GB memory100200-3008
6 CPU/24 GB memory100250-40012
GPU200-2501400-220020
Table 3. 2023.7 and 2023.10 Extractors
TierMaximum pages/documentExpected throughput (pages/hour)AI Units/hour
0.5 CPU/2 GB memory2560-2001
1 CPU/4 GB memory50120-2402
2 CPU/8 GB memory75200-2804
4 CPU/16 GB memory100250-4008
6 CPU/24 GB memory100350-50012
GPU200-2501000-200020

The expected throughput is expressed for each replica, in page/hour, and a minimum and maximum expected throughput, depending on the document itself. The ML Skill should be sized for the highest expected throughput (spike), and not the average throughput in a day, week, or month.

Note: When sizing the infrastructure, make sure to start from the largest document the skill needs to handle and the expected throughput.

Examples

Example 1

The ML Skill needs to process the following using a 2023.10 Extractor:
  • Documents containing maximum five pages.
  • A maximum spike of 300 pages per hour.

Since the throughput is on the lower side and the document size is small, a GPU is not needed in this example. Two to four replicas of the 0.5 CPU or 1 CPU tier is sufficient.

Example 2

The ML Skill needs to process the following using a 2023.4 Extractor:
  • Documents containing maximum 80 pages.
  • A maximum spike of 900 pages per hour.

For this example, either three replicas of the 4 CPU tier, or a single GPU tier is sufficient.

Note: A single replica does not have high availability, so it is always recommended to use at least two replicas for critical production workflows.

Example 3

The ML Skill needs to process the following using a 2023.10 Extractor:
  • Documents containing maximum 50 pages.
  • A maximum spike of 3000 pages per hour.
There are two ways to meet this requirements:
  • Use 3 GPU replicas.
  • Use 12-15 replicas of the 4 CPU or 6 CPU tier.

Both options have high availability because there are more than two replicas for the ML Skill.

Size the infrastructure for one replica

You can check the tables from the General guidelines section for the expected throughput from a single extraction replica, depending on the model version and the tier.

Note: To achieve maximum throughput potential, you must keep the extraction replica under constant load.
To make sure that the replica is consistently busy, the following criteria should be met:
  • Ideally, there should be minimal idle time between the moment the replica sends the response to one request and the moment the replica receives the data for the next request.
  • The replica shouldn't be overloaded. Requests are processed one after the other (serially). This means there is always one active request being processed and a queue of pending requests. If this queue becomes too long, the replica rejects new incoming requests, displaying a 429 (too many requests) HTTP status code.

The key point to remember when sizing the infrastructure for a single replica is to ensure the workload is balanced. The workload should not be so light that the replica remains idle, or so heavy that it starts rejecting tasks.

Determine the needed number of replicas

To determine the number of replicas required, you need to:
  • Identify the busiest relevant time period for the replicas. For example, you need to identify the busiest hour of activity, not the peak one-minute interval or a 12-hour stretch. After you identify this time period , estimate the demand (number of pages or requests) during that time.
  • Divide the estimate by the throughput per replica, described in the Size the infrastructure for one replica section.
  • Add some extra capacity as a safety measure.

Note that using the busiest time period can lead to overprovisioning when demand is significantly lower. To address this, you can manually increase or decrease the size of a deployment, depending on demand. This can be helpful if, for example, there's a very busy one-hour interval requiring 10 replicas, followed by 23 hours of low activity where only 2 replicas are needed. This can result in a considerable length of time where the system is overprovisioned.

Measure demand and supply: pages or requests

The number of pages and the density of those pages are key factors. The page count is more relevant than the number of requests. However, in practical terms, the requests are easier to count.

Tip: You can use historical data to make this conversion easier. For a particular skill with a recorded history, you can check the metering telemetry to identify the distribution of page counts for requests made to that skill. For example, if a skill received mostly documents with two or three pages in the last month, you can assume the future trend will continue with an average of 2.5 pages per document.

Determine if a deployment is underprovisioned

When you determine if a deployment is underprovisioned, CPU utilization is not relevant, since each replica will maximize CPU/GPU usage while processing a request, regardless of the pending request queue.

The important factor is the end-to-end time, which is the sum of waiting time and actual processing time.

For example, if you picked a tier with a throughput of approximately 900 pages/hour, or about 4 seconds/page, and you're sending 5-page documents, it typically takes about 30 seconds per document.

Then, you can approximate a waiting time of around 10 seconds. This means that there is a wait time (meaning the request was not instantly taken by the replica because it was busy handling pre-existing requests). This also indicated that this wait time was around 10 seconds.

If the actual end-to-end time (the measured time) and the expected end-to-end time (estimated as a function of the tier) difference is more than zero, it means the replica is working non-stop. Furthermore, if the wait time increases as demand increases, it's clear that the deployment is under sustained stress. Moreover, any 429 status codes (too many requests) are a sign of underprovisioning.

Determine if a deployment is overprovisioned

The previous sections can help determine that the deployment is effectively provisioned, but we recommend you follow these steps for an accurate analysis of your deployment:
  • Identify the busiest time period. For this example, let's assume it's one hour, or 3,600 seconds.
  • Take the current number of replicas, let's assume it's 10. This results to 36,000 "replica-seconds" (similar concept to "person-hour").
  • Sum up the total duration of requests (sum of end-to-end times). Let's assume it's 10,000 seconds.
In this example, since 10,000 is lower than 36,000, it means that your current infrastructure is more than you need. You can try to reduce the deployment to eight replicas, monitor the performance, and if it operates smoothly, reduce to six replicas. Reasses performance with each adjustment.

Was this page helpful?

Get The Help You Need
Learning RPA - Automation Courses
UiPath Community Forum
Uipath Logo White
Trust and Security
© 2005-2024 UiPath. All rights reserved.