- Overview
- Getting started
- Activities
- Insights dashboards
- Document Understanding Process
- Quickstart tutorials
- Framework components
- ML packages
- Overview
- Document Understanding - ML package
- DocumentClassifier - ML package
- ML packages with OCR capabilities
- 1040 - ML package
- 1040 Schedule C - ML package
- 1040 Schedule D - ML package
- 1040 Schedule E - ML package
- 1040x - ML package
- 3949a - ML package
- 4506T - ML package
- 709 - ML package
- 941x - ML package
- 9465 - ML package
- ACORD125 - ML package
- ACORD126 - ML package
- ACORD131 - ML package
- ACORD140 - ML package
- ACORD25 - ML package
- Bank Statements - ML package
- Bills Of Lading - ML package
- Certificate of Incorporation - ML package
- Certificate of Origin - ML package
- Checks - ML package
- Children Product Certificate - ML package
- CMS 1500 - ML package
- EU Declaration of Conformity - ML package
- Financial Statements - ML package
- FM1003 - ML package
- I9 - ML package
- ID Cards - ML package
- Invoices - ML package
- Invoices Australia - ML package
- Invoices China - ML package
- Invoices Hebrew - ML package
- Invoices India - ML package
- Invoices Japan - ML package
- Invoices Shipping - ML package
- Packing Lists - ML package
- Payslips - ML package
- Passports - ML package
- Purchase Orders - ML package
- Receipts - ML Package
- Remittance Advices - ML package
- UB04 - ML package
- Utility Bills - ML package
- Vehicle Titles - ML package
- W2 - ML package
- W9 - ML package
- Other Out-of-the-box ML Packages
- Public endpoints
- Traffic limitations
- OCR Configuration
- Pipelines
- OCR services
- Supported languages
- Deep Learning
- Training high performing models
- Deploying high performing models
- Data and security
- Licensing
Document Understanding User Guide
Deploying high performing models
As Machine Learning (ML) models improve in accuracy over time, their resource requirements change as well. For the best performance, it is important that when deploying ML models via AI Center, the skills are appropriately sized with respect to the traffic they need to handle. For the most part, infrastructure is sized with respect to the number of pages per unit of time (minute or hour). A document can have a single page, or multiple pages.
To deploy infrastructure via AI Center, there are a few important aspects to keep in mind for optimal performance.
There is only one type of GPU infrastructure available. This is highlighted by the checkbox to enable GPU. Each skill runs on a single virtual machine (VM) or node that has a GPU. In this case, CPU and memory are not relevant, since the skill can use all the available CPU and memory resources on that nodes. Besides throughput, GPU is much faster. Because of this, if latency is critical, it is recommended to use GPU.
CPU and memory can be fractioned, which means multiple ML Skills can run on the same node. To avoid any disturbance from a neighboring skill, each ML Skill is limited to the amount of memory and CPU they can consume, depending on the selected tier. Higher CPU leads to faster processing (for a page), while higher memory leads to a larger number of documents that can be processed.
The number of replicas determines the number of containers that are used to serve requests from the ML model. A higher number leads to a larger amount of documents that can be processed in parallel, subject to the limits of that particular tier. The number of replicas is directly tied to the infrastructure type (number of CPUs per replica, or if using a GPU), in the sense that both replicas and infrastructure size can directly affect throughput (pages/minute).
The number of robots impacts throughput. To get efficient throughput, the number of robots needs to be sized in such a manner that it does not overload the ML Skills. This is dependent on the automation itself and should be tested. As a general guideline, you can use one to three robots as a starting point for each replica the ML Skill has. Depending on the overall process time (excluding ML Extractor), the number of robots can be higher or lower (or the number of replicas).
If the infrastructure is not sized correctly, the models can be placed under a very high load. In some cases this can lead to a backlog of requests, long processing time, or even failures when processing documents.
Insufficient memory most commonly encountered on the lower CPU tiers (0.5 CPU or 1 CPU). If you need to process a very large payload (one or several large documents), it can lead to an out of memory exception. This is related to the document size in terms of pages and text density (how much text there is per page). Since the requirements are very specific to each use case, it is not possible to provide exact numbers. You can check the guidelines in the Sizing the infrastructure correctly section for more detailed information. If you encounter an insufficient memory situation, the general recommendation is to go to the next tier.
520
and
499
status codes), backlog, or even lead to the model crashing
(503
and 500
status codes). If you encounter
an insufficient compute situation, we recommend going to the next tier, or even to
the GPU tier.
This section provides general guidelines on how the models perform on each different skill size.
Tier | Maximum pages/document | Expected throughput (pages/hour) | AI Units/hour |
---|---|---|---|
0.5 CPU/2 GB memory | 25 | 300-600 | 1 |
1 CPU/4 GB memory | 50 | 400-800 | 2 |
2 CPU/8 GB memory | 100 | 600-1000 | 4 |
4 CPU/16 GB memory | 100 | 800-1200 | 8 |
6 CPU/24 GB memory | 100 | 900-1300 | 12 |
GPU | 200-250 | 1350-1600 | 20 |
Tier | Maximum pages/document | Expected throughput (pages/hour) | AI Units/hour |
---|---|---|---|
0.5 CPU/2 GB memory | 25 | 40-100 | 1 |
1 CPU/4 GB memory | 50 | 70-140 | 2 |
2 CPU/8 GB memory | 75 | 120-220 | 4 |
4 CPU/16 GB memory | 100 | 200-300 | 8 |
6 CPU/24 GB memory | 100 | 250-400 | 12 |
GPU | 200-250 | 1400-2200 | 20 |
Tier | Maximum pages/document | Expected throughput (pages/hour) | AI Units/hour |
---|---|---|---|
0.5 CPU/2 GB memory | 25 | 60-200 | 1 |
1 CPU/4 GB memory | 50 | 120-240 | 2 |
2 CPU/8 GB memory | 75 | 200-280 | 4 |
4 CPU/16 GB memory | 100 | 250-400 | 8 |
6 CPU/24 GB memory | 100 | 350-500 | 12 |
GPU | 200-250 | 1000-2000 | 20 |
The expected throughput is expressed for each replica, in page/hour, and a minimum and maximum expected throughput, depending on the document itself. The ML Skill should be sized for the highest expected throughput (spike), and not the average throughput in a day, week, or month.
Example 1
- Documents containing maximum five pages.
- A maximum spike of 300 pages per hour.
Since the throughput is on the lower side and the document size is small, a GPU is not needed in this example. Two to four replicas of the 0.5 CPU or 1 CPU tier is sufficient.
Example 2
- Documents containing maximum 80 pages.
- A maximum spike of 900 pages per hour.
For this example, either three replicas of the 4 CPU tier, or a single GPU tier is sufficient.
Example 3
- Documents containing maximum 50 pages.
- A maximum spike of 3000 pages per hour.
- Use 3 GPU replicas.
- Use 12-15 replicas of the 4 CPU or 6 CPU tier.
Both options have high availability because there are more than two replicas for the ML Skill.
You can check the tables from the General guidelines section for the expected throughput from a single extraction replica, depending on the model version and the tier.
- Ideally, there should be minimal idle time between the moment the replica sends the response to one request and the moment the replica receives the data for the next request.
- The replica shouldn't be overloaded.
Requests are processed one after the other (serially). This means there is always one
active request being processed and a queue of pending requests. If this queue becomes too
long, the replica rejects new incoming requests, displaying a
429 (too many requests) HTTP
status code.
The key point to remember when sizing the infrastructure for a single replica is to ensure the workload is balanced. The workload should not be so light that the replica remains idle, or so heavy that it starts rejecting tasks.
- Identify the busiest relevant time period for the replicas. For example, you need to identify the busiest hour of activity, not the peak one-minute interval or a 12-hour stretch. After you identify this time period , estimate the demand (number of pages or requests) during that time.
- Divide the estimate by the throughput per replica, described in the Size the infrastructure for one replica section.
- Add some extra capacity as a safety measure.
Note that using the busiest time period can lead to overprovisioning when demand is significantly lower. To address this, you can manually increase or decrease the size of a deployment, depending on demand. This can be helpful if, for example, there's a very busy one-hour interval requiring 10 replicas, followed by 23 hours of low activity where only 2 replicas are needed. This can result in a considerable length of time where the system is overprovisioned.
The number of pages and the density of those pages are key factors. The page count is more relevant than the number of requests. However, in practical terms, the requests are easier to count.
When you determine if a deployment is underprovisioned, CPU utilization is not relevant, since each replica will maximize CPU/GPU usage while processing a request, regardless of the pending request queue.
The important factor is the end-to-end time, which is the sum of waiting time and actual processing time.
For example, if you picked a tier with a throughput of approximately 900 pages/hour, or about 4 seconds/page, and you're sending 5-page documents, it typically takes about 30 seconds per document.
Then, you can approximate a waiting time of around 10 seconds. This means that there is a wait time (meaning the request was not instantly taken by the replica because it was busy handling pre-existing requests). This also indicated that this wait time was around 10 seconds.
If the actual end-to-end time (the measured time) and the expected end-to-end time (estimated as a function of the tier) difference is more than zero, it means the replica is working non-stop. Furthermore, if the wait time increases as demand increases, it's clear that the deployment is under sustained stress. Moreover, any 429 status codes (too many requests) are a sign of underprovisioning.
- Identify the busiest time period. For this example, let's assume it's one hour, or 3,600 seconds.
- Take the current number of replicas, let's assume it's 10. This results to 36,000 "replica-seconds" (similar concept to "person-hour").
- Sum up the total duration of requests (sum of end-to-end times). Let's assume it's 10,000 seconds.
- Introduction to ML model performance
- GPU
- CPU
- Number of replicas
- Number of robots
- Potential issues related to infrastructure sizing
- Insufficient memory
- Insufficient compute
- Size the infrastructure correctly
- General guidelines
- Examples
- Size the infrastructure for one replica
- Determine the needed number of replicas
- Measure demand and supply: pages or requests
- Determine if a deployment is underprovisioned
- Determine if a deployment is overprovisioned