Incremental Extraction
It usually takes much time to extract great amounts of data from web sites. If we need to have only up-to-date data it is necessary to extract all the data from the web site and process thousands or even millions of entries. Some of the web sites change only partially, for example there appear new vacancies on job boards, new messages on the forums and so on. In this case we need to extract only new updated data as the old data remains the same. To solve this issue we implemented a special feature to our software product, it is called Incremental Extraction. Incremental Extraction's work can be described this way: the Agent compares the extracted data to the data that have been extracted before. When the data coincide the agent performs one of the necessary operations, that is it can stop working, it can stop working with the current page and go to the specified state, it can ignore the extracted data (not store it and continue working).
- Create new agent. It should have Datasource which will be used to define the existing data.
-
Select the Agent
in the Workspace View
-
In the Properties View
select the tab Incremental
- Click Edit...
- There will appear the Incremental Extraction dialog.
-
Click Add
to add the condition. The condition defines the way how the agent will handle extraction of existing data.
-
Select Datasource
which will be used for comparing of the extracted data
-
Select Captures
that will be used during the extracted data comparing. There will compared the data linked to the selected captures
- Click OK to finish configuring of the Incremental Extraction .
Hints
- The agent should extract updated data above all others. Only in this case it is possible to use this feature in the optimal way.
- Target datasource should support data appending. That is why for CSV, Excel and Text formats it is necessary to enable Append mode. Also the file name should have the same name every time you run the agent (running the agent creates a new file by default). For this you need to remove all the dynamic parts from the file name. Same about Sheet name in Excel datasource.