Tutorial 2

In our second tutorial we will describe the process of extracting information from the web site Demo Store . It will be a more difficult extraction, a so-called deep-level extraction when the data is "in the depth" of the web site. We will configure the agent to visit all the product pages and scrape the data including the product names, price, description, SKU, etc. and save it into the Excel file.

1 Creating New Project

We create a new project for each new web site we want to extract data from. It is also possible to create several extraction agents inside of one project, but it is not so comfortable. Click the New Project button in the Tool Bar or in the File > New Project menu.
Create New Project
Enter a project name in the dialog window. Click the Finish button. A new project will appear in the Workspace view in the right upper corner of the window.
Create project in the workspace view

1.1 Starting Page Navigation

Navigate to the starting page from which the agent will start working. For this enter URL of the starting page into the Navigation Bar . The target web site URL is http://www.websundew.com/demo/ . Type this address.
Load the starting URL
Press Enter or click Navigate button. Wait until the navigation process is over. Now we are ready to start creating the Agent that will capture product information from Demo Store

1.2 Creating New Agent

Click the Agent button in the Tool Bar or in the File > New Agent menu.
Create New Agent
You will see an Agent Configuration Wizard . Select Start Up mode. In our case we have only one URL so choose the first option - Single URL .
Select start up mode
Click Next . Type the Agent's name and click the Finish button. You will see an Agent Editor .
Agent editor

1.3 Configuring the Agent

On the left hand corner of the Agent Editor there is an Agent Diagram . This diagram shows the Agent's state. Init State is the initial state from which the Agent starts working. This state loads the initial web page. Page 1 State reflects the loaded page. To the right there is a Browser Window linked to the selected state in the Diagram Editor.
Agent graph editor

1.4 Second Level Navigation

To collect the data from all of the detailed pages we need to visit all of them. We need to create a loop that will iterate over links and click each of them.
Detail links
Click the Deep Crawl button in the Tool Bar .
Deep Crawl
Select Data Iterator Pattern in the dialog window that appears. Click the Finish button. Iterator Pattern Wizard will appear in the left hand side of the window. Click the first link in the browser, it will be highlighted in light blue. Click the Add button in the pattern wizard.
Select HTML element on the web page
Click the Find button. Wait until the program completes looking for the patterns. Select the proper result. All the links will be highlighted in blue. Click Next at the bottom of the wizard.
Find data pattern
Enter a pattern name. Click the Finish button. The Loop statement (that iterates other all of the links) will be added to the current state. Inside Loop you can find Click statement that leads to the new state. Click the new state. The browser window will show the product details web page.
Loop created

1.5 Capturing Data from the Product Detail Page

Click Capture in the Tool Bar . Select Simple Data Pattern . Click the Finish button. On the left hand corner there will appear a Simple Details Wizard. Click on the product name. The product name will be highlighted in light blue. Click Add in the pattern wizard. The new field will be added. You can change the file name name by clicking on it. Repeat the action for the other fields: price, model, product SKU.
Capture details data
Click the Next button in the bottom of the pattern wizard. Type the pattern name and click Finish . The Capture Block statment will be added to the current state. You can see capture data in Preview view.
Capture preview

1.6 Capturing Data from Linked Pages

The web page associated with Page 1 state contains linked pages. We need to visit all of them. For that purpose we can use Paginator .
Linked pages
Click the Paginator button in the Tool Bar to create a Paginator which will enable the Agent to visit all the linked pages and extract data from all of them. Select Simple Next Page Pattern in paginator dialog window. Click the Finish button. Simple Next Page Wizard will appear on the left hand window. Click Next page link in the browser window.
Select next node
Click the Next button in the Paginator Wizard. Enter the name of the pattern. Click the Finish button. The Paginator statement will be added to the current state.
Paginator in the diagram editor

1.7 Saving Data

Click Datasource in the Tool Bar to create Datasource . Datasource Wizard will appear. Select the format you want. In our case it will Excel . Select Excel .
Export to the Excel file
Click the Next button. Select the Agent and mark the fields you want to save. Click Next if you want to use default settings. Enter Datasource name. Click the Finish button.


We have created the Data Extraction Agent . Now we can use it to extract and save the data.

1.8 Running the Agent

Click the Run button in the Tool Bar . Wait till the Agent completes working. A dialog window will appear.
Agent execution result
You can see the results of the agent's work and the path to the saved file. Select file name and click Open to view result.
Extracted data

Page Modified 6/9/17 10:12 AM