- Release Notes
- Getting Started
- Setup and Configuration
- Automation Projects
- Dependencies
- Types of Workflows
- File Comparison
- Automation Best Practices
- Source Control Integration
- Debugging
- The Diagnostic Tool
- Workflow Analyzer
- About Workflow Analyzer
- ST-NMG-001 - Variables Naming Convention
- ST-NMG-002 - Arguments Naming Convention
- ST-NMG-004 - Display Name Duplication
- ST-NMG-005 - Variable Overrides Variable
- ST-NMG-006 - Variable Overrides Argument
- ST-NMG-008 - Variable Length Exceeded
- ST-NMG-009 - Prefix Datatable Variables
- ST-NMG-011 - Prefix Datatable Arguments
- ST-NMG-012 - Argument Default Values
- ST-NMG-016 - Argument Length Exceeded
- ST-DBP-002 - High Arguments Count
- ST-DBP-003 - Empty Catch Block
- ST-DBP-007 - Multiple Flowchart Layers
- ST-DBP-020 - Undefined Output Properties
- ST-DBP-023 - Empty Workflow
- ST-DBP-024 - Persistence Activity Check
- ST-DBP-025 - Variables Serialization Prerequisite
- ST-DBP-026 - Delay Activity Usage
- ST-DBP-027 - Persistence Best Practice
- ST-DBP-028 - Arguments Serialization Prerequisite
- ST-USG-005 - Hardcoded Activity Arguments
- ST-USG-009 - Unused Variables
- ST-USG-010 - Unused Dependencies
- ST-USG-014 - Package Restrictions
- ST-USG-020 - Minimum Log Messages
- ST-USG-024 - Unused Saved for Later
- ST-USG-025 - Saved Value Misuse
- ST-USG-026 - Activity Restrictions
- ST-USG-027 - Required Packages
- ST-USG-028 - Restrict Invoke File Templates
- Variables
- Arguments
- Imported Namespaces
- Recording
- UI Elements
- Control Flow
- Selectors
- Object Repository
- Data Scraping
- About Data Scraping
- Example of Using Data Scraping
- Image and Text Automation
- Automating Citrix Technologies
- RDP Automation
- Salesforce Automation
- SAP Automation
- VMware Horizon Automation
- Logging
- The ScreenScrapeJavaSupport Tool
- The WebDriver Protocol
- Test Suite - Studio
- Extensions
- Troubleshooting
About Data Scraping
Data scraping enables you to extract structured data from your browser, application or document to a database, .csv file or even Excel spreadsheet.
Structured data is a specific kind of information that is highly organized and is presented in a predictable pattern. For example, all Google search results have the same structure: a link at the top, a string of the URL and a description of the web page. This structure enables Studio to easily extract the information, as it always knows where to find it.
The scraping wizard can be opened from the Design tab, by clicking the Data Scraping button.
The main steps of the data scraping wizard are:
-
Select the first and last fields in the web page, document or application that you want to extract data from, so that Studio can deduce the pattern of the information.
Note: Studio automatically detects if you indicated a table cell, and asks you if you want to extract the entire table. If you click Yes, the Extract Wizard displays a preview of the selected table data. -
Customize column headers and choose whether or not to extract URLs.
-
Preview the data, edit the number of maximum results to be extracted and change the order of the columns.
- Optionally click Extract Correlated Data. This enables you to go through the Extract Wizard again, to extract additional info and add it as a new column in the same table.
-
Indicate the Next button in the web page, application or document (if the information you want to extract spans multiple pages).
After you are finished with the wizard, a sequence is generated in Studio.
Data scraping always generates a container (Attach Browser or Attach Window) with a selector for the top-level window and an Extract Structured Data activity with a partial selector, thus ensuring a correct identification of the app to be scraped.
Additionally, the Extract Structured Data activity also comes with an automatically generated XML string (in the ExtractMetadata property) that indicates the data to be extracted.
Lastly, all the scraped information is stored in a DataTable variable, that you can later use to populate a database, a .csv file or an Excel spreadsheet.