- Getting Started
- Network requirements
- Single-node requirements and installation
- Multi-node requirements and installation
- Post-installation
- Accessing AI Center
- Provision an AI Center tenant
- Updating Orchestrator and Identity Server certificates
- Resizing PVC
- Adding a new node to the cluster
- ML packages offline installation
- Configuring the Cluster
- Configuring the FQDN post-installation
- Backing up and Restoring the Cluster
- Using the Monitoring Stack
- Setting up a Kerberos Authentication
- Provisioning a GPU
- Using the configuration file
- Node scheduling
- Migration and Upgrade
- Basic Troubleshooting Guide
AI Center standalone troubleshooting
This section provides troubleshooting information for AI Center in standalone environment.
The sections below are specific for AI Center.
Make sure that you follow the procedure suited to your needs.
input.json
file will expire and the AI Center registration with Identity Server fails. Follow the steps below to recover it.
- Log in to
https://alm.<LB DNS>
using theadmin
username. To obtain the password, run the following command:kubectl -n argocd get secret argocd-admin-password -o jsonpath={.data.password} | base64 -d
kubectl -n argocd get secret argocd-admin-password -o jsonpath={.data.password} | base64 -d - Go to ArgoCD and click on the aicenter tile.
- Click APP DETAILS and go to the Manifest tab.
- In the Manifest tab, click Edit.
- Obtain the new identity token by updating the
accessToken
field in the Manifest tab and click Save.
Synchronization starts automatically and is completed.
curl: (92)
HTTP/2 stream 0 was not closed cleanly: HTTP_1_1_REQUIRED (err 13)
.
If there is an issue with your databases, you can recreate them from scratch directly post-installation.
You can do this by running an SQL command to drop all the DBs and recreate them as follows:
USE [master]
ALTER DATABASE [AutomationSuite_AICenter] SET SINGLE_USER WITH ROLLBACK IMMEDIATE
DROP DATABASE [AutomationSuite_AICenter]
CREATE DATABASE [AutomationSuite_AICenter]
GO
USE [master]
ALTER DATABASE [AutomationSuite_AICenter] SET SINGLE_USER WITH ROLLBACK IMMEDIATE
DROP DATABASE [AutomationSuite_AICenter]
CREATE DATABASE [AutomationSuite_AICenter]
GO
This issue might occur during fabric installation. The installer might fail with below similar error.
appproject.argoproj.io/fabric created
configmap/argocd-cm configured
[INFO] [2021-09-02T09:21:15+0000]: Checking if ArgoCD password was reset, looking for secrets/argocd-admin-password.
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:21:16+0000]: Secret not found, trying to log in with initial password...1/10
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:21:36+0000]: Secret not found, trying to log in with initial password...2/10
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:21:56+0000]: Secret not found, trying to log in with initial password...3/10
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:22:16+0000]: Secret not found, trying to log in with initial password...4/10
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:22:36+0000]: Secret not found, trying to log in with initial password...5/10
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:22:56+0000]: Secret not found, trying to log in with initial password...6/10
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:23:17+0000]: Secret not found, trying to log in with initial password...7/10
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:23:37+0000]: Secret not found, trying to log in with initial password...8/10
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:23:57+0000]: Secret not found, trying to log in with initial password...9/10
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:24:17+0000]: Secret not found, trying to log in with initial password...10/10
[ERROR][2021-09-02T09:24:37+0000]: Failed to log in
appproject.argoproj.io/fabric created
configmap/argocd-cm configured
[INFO] [2021-09-02T09:21:15+0000]: Checking if ArgoCD password was reset, looking for secrets/argocd-admin-password.
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:21:16+0000]: Secret not found, trying to log in with initial password...1/10
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:21:36+0000]: Secret not found, trying to log in with initial password...2/10
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:21:56+0000]: Secret not found, trying to log in with initial password...3/10
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:22:16+0000]: Secret not found, trying to log in with initial password...4/10
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:22:36+0000]: Secret not found, trying to log in with initial password...5/10
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:22:56+0000]: Secret not found, trying to log in with initial password...6/10
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:23:17+0000]: Secret not found, trying to log in with initial password...7/10
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:23:37+0000]: Secret not found, trying to log in with initial password...8/10
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:23:57+0000]: Secret not found, trying to log in with initial password...9/10
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:24:17+0000]: Secret not found, trying to log in with initial password...10/10
[ERROR][2021-09-02T09:24:37+0000]: Failed to log in
Check all the required subdomains and ensure they are configured correctly and are routable as follows:
getent ahosts automationsuite.mycompany.com | awk '{print $1}' | sort | uniq
getent ahosts alm.automationsuite.mycompany.com | awk '{print $1}' | sort | uniq
getent ahosts registry.automationsuite.mycompany.com | awk '{print $1}' | sort | uniq
getent ahosts monitoring.automationsuite.mycompany.com | awk '{print $1}' | sort | uniq
getent ahosts objectstore.automationsuite.mycompany.com | awk '{print $1}' | sort | uniq
getent ahosts automationsuite.mycompany.com | awk '{print $1}' | sort | uniq
getent ahosts alm.automationsuite.mycompany.com | awk '{print $1}' | sort | uniq
getent ahosts registry.automationsuite.mycompany.com | awk '{print $1}' | sort | uniq
getent ahosts monitoring.automationsuite.mycompany.com | awk '{print $1}' | sort | uniq
getent ahosts objectstore.automationsuite.mycompany.com | awk '{print $1}' | sort | uniq
automationsuite.mycompany.com
with you cluster FQDN.
If the above commands/lines do not return a routable IP address, then the subdomain required for AI Center is not configured properly.
This error is encountered when the DNS is not public.
You need to add the Private DNS Zone (for Azure) or route 53 (for AWS).
If above commands return the proper IP address, then follow the below steps.
- Delete the ArgoCD namespace by executing
following command:
export KUBECONFIG=/etc/rancher/rke2/rke2.yaml export PATH=$PATH:/var/lib/rancher/rke2/bin kubectl delete namespace argocd
export KUBECONFIG=/etc/rancher/rke2/rke2.yaml export PATH=$PATH:/var/lib/rancher/rke2/bin kubectl delete namespace argocd - Run the following command to
verify:
kubectl get namespace
kubectl get namespace
There should not be ArgoCD namespace in the output of this command.
For issues related to accessing AI Center, make sure to follow the steps from the following sections:
https://objectstore.${CONFIG_CLUSTER_FQDN}
url once with every browser that you want to use in order to be able to interact with storage.
- Expired Identity Token
- Description
- Recovery steps
- Message: Curl: (92) HTTP/2 Stream 0 Was Not Closed Cleanly: HTTP_1_1_REQUIRED (err 13)
- Description
- Solution
- How to recreate databases
- Installer cannot connect to ArgoCD to check if password was reset
- Description
- Solution 1
- Solution 2
- Issues while accessing AI Center
- Enabling AI Center on the restored cluster
- Enabling AI Center on the restored cluster