Agent, Pattern, Capture, Paginator, Datasource

The main program components are:
  • An Agent is one of the main and most important parts of the program. It consists of states. Each state is associated with the web page and contains a list of actions. When the agent runs, it consistently executes actions on the web page (state). The action that causes loading new page leads to a new state. There is a special state that is not associated with any web page, it is an Initial State . It loads the page from which the agent starts its work.
  • A Pattern is used to search the elements on the web page. The pattern keeps the information on how the elements are located on the web page, which allows it to find the data with the similar structure on different web pages. We can conditionally divide all the patterns into two groups:
    1. Data Patterns which search for the HTML document elements which should be extracted.
    2. Next Page Patterns which are used by the Paginator to navigate to the linked web pages.
  • Capture allows an agent to convert HTML elements found by the pattern into the data suitable for the storing in the proper format. Also Capture allows to modify the captured data, for example: clean the extracted text, convert image into the desired format, etc.
  • Datasource is used to store the captured data in the required format. It can be a file (Excel, XML, CSV, etc.), database (MSSQL, Oracle, MySQL etc.) or any custom text format.
Page Modified 6/9/17 10:12 AM