Advanced Data Extraction

Advanced Data Extraction

This tutorial requiresWebSundew Standard or higher. Download free trial here

Welcome to your second tutorial. In the first tutorial we created an extraction project that extracts data from simple e-commerce site. In the second tutorial we will create Agent that will visit all product detail pages. Agent will extract and capture into Excel file the following fields: product title, price, image and product properties.

We will use one of our demo web sites (e-commerce demo) to create Agent. It will help you to easily repeat steps from this tutorial as web site will have same structure as we used in it.

Demo WebSite

Step 1 - Create New Project

The first step is to create new project. The project is a place where all components of data extraction process are stored. Click New Project in the application toolbar.

Create New Project - Toolbar Button

New project dialog will appear:

Create New Project

Enter project name. It is better to use name related to the target web site. It will be easy to find the project later. Click Ok The new project will be added into the project workspace.

Project View

Every time you run WebSundew you will be able to access it in the Project View

Step 2 - Create Agent

Agent is a one of the main concepts of WebSundew. It automates all activity the user performs to collect web data. For example navigate over web pages, click links, extract data and store it into data storage.

Click New Agent in the application toolbar.

Create New Agent - Toolbar Button

New agent dialog will appear:

Create New Agent

Enter URL of the target web site. We will use our demo e-commerce web site - https://demo.websundew.io/ecommerce/products for this tutorial. Click Ok

New Agent will be created and added to the current project. Also agent's editor will be opened. The created Agent will have two states: Init and State1.

Agent's Editor

Step 3 - Visit Detail Pages

We need to configure Agent to visit all products pages. The Agent should collect links to detail pages and sequentially visit each page and capture product data. To configure agent in this way click Deep Crawler in the application toolbar.

Create Deep Crawler - Toolbar Button

Deep crawler dialog will appear:

Create Deep Crawler - Toolbar Button

Select List and click Ok.

The List pattern wizard will start. We need to configure List pattern to collect all links.

Create List Pattern - Wizard Start

Click on the first product link.

Create List Pattern - Click First Link

Click Add in the wizard. The pattern builder will try to find patterns. The Result pane will contains several results.

Create List Pattern - Add Link and Select Result

We need to select appropriate result. In our case we have 9 links on the web page. Click on the 9. The collected data will be highlighted on the web page and will be available in the Preview View.

Create List Pattern - Add Link and Select Result

Now click Finish.

The Loop statement will be added into State1. This Loop will contain two statements: Capture and Load Page. The Capture statement is used to extract link to the detail page, Load Page used to load new state from the extracted url.

Agent Structure - Three States

The new state State2 will be added to the agent. This state will be associated with product detail page.

Step 4 - Capture Data

Now we can capture required data on the page. We will capture product title, price, etc and properties.

You can notice that detail page has following structure: fixed part with product image, title and price:

Capture - Fixed Part

And variable part that contains properties in pairs: Name - Value. These properties are depend on product. For example: Digital Camera has Lens Mount property, but Desktop Computer has not.

Capture - Fixed Part

WebSundew uses special type of data patterns: Simple Detail data pattern to capture statically located data and Linked Tag data pattern to capture data in pairs Name - Value for variable part.

4.1 Capture Simple Data

Simple Detail pattern captures data that statically located on the web page. To capture such data
click Capture in the application toolbar.

Capture - Toolbar Button

Capture type selector dialog will appear.

Capture - Select Simple

Select Simple then click Ok. The Simple Details pattern wizard will start.

Simple Data Pattern Wizard - Start

Now you need to configure pattern fields. Click on the required data, i.e. title on the web page then click Add in the pattern wizard.

Simple Data Pattern Wizard - Add Field

Repeat process for all required field. Also add main image. You can rename field to more affordable names (double click on the field title to start editing).

Simple Data Pattern Wizard - Rename Fields

Click Finish to complete Simple Detail pattern wizard. The Capture Block statement based on Simple Detail pattern will be added into State2.

Agent Graph - Capture Block

4.2 Capture Name Value Pairs

Linked Tag pattern captures data that located in pairs: Name - Value. To capture such data click Capture in the application toolbar.

Capture - Toolbar Button

Capture type selector dialog will appear.

Capture - Select Simple

Select Pairs then click Ok. The Linked Tag data pattern wizard will start.

Linked Tag Wizard Wizard - Start

Now you need to add pairs Name (tag) - Value. Select tag element on the web page:

Linked Tag Wizard - Select Tag Node

Now you need add value part. Click Value tab in the wizard

Linked Tag Wizard - Activate Value Tab

Select value on the web page.

Linked Tag Wizard - Select Value Node

Click Add in the pattern wizard to complete creating field.

Linked Tag Wizard - Activate Value Tab

Repeat same process for all required data.

Linked Tag Wizard - Activate Value Tab

Click Finish to complete pattern creating. The Capture Block statement based on Lined Tag pattern will be added into State2.

Agent Graph - Capture Block

Step 5 - Pagination

To collected all data the Agent should visit all pages. Select State1 in agent's graph. The paging is simple with separate next page element.

Web Site Pagination

To configure Agent to visit all pages we need to create Pagination. Click Pagination in the application toolbar.

Create New Pagination - Toolbar button

Pagination wizard dialog will appear:

Pagination Wizard - Select Type

The web site uses simple pagination with separate next page element. So we can choose Simple page pattern type and click Ok. Next page pattern wizard will appear:

Page Pattern - Wizard Start

WebSundew uses page pattern to find HTML element that leads to next pages. Click on the next page element in the browser part of the agent, then click Select Next in the wizard.

Page Pattern - Wizard Complete

Click Finish in the wizard part of the agent's editor. Next page pattern will be added to the project.

The Agent will use this pattern on all similar pages to find next page element.

Step 6 - Store Captured Data

Now we are ready to store captured data. Click Export in the application toolbar.

Export Data - Toolbar Button

Data export wizard will appear:

Export Data - Wizard

Select Excel, then click Next.

Export Data - Configure Excel

Configure Excel properties or leave default values. Click Finish. Data storage will be added to the Agent.

Step 7 - Run the Agent

Now we are ready to run the Agent. Click Run in the application toolbar.

Run the Agent

The Agent will start extracting the data.

Agent - Extracting the data