Wrappers are specialised program routines that
automatically extract data from Internet websites and convert the information
into a structured format. More specifically, wrappers have three main
functions. Firstly, they must be able to
download HTML pages from a website. Secondly, search for, recognise and extract
specified data. Thirdly, save this data in a suitably structured format to
enable further manipulation [6]. The data can then be imported into other
applications for additional processing. According to [20], over 80% of the
published information on the WWW is based on databases running in the
background. When compiling this data into HTML documents the structure of the
underlying databases is completely lost. Wrappers try to reverse this process
by restoring the information to a structured format [21]. With the right
programs, it is even possible to use the WWW as a large database. By using
several wrappers to extract data from the various information sources of the
WWW, the retrieved data can be made available in an appropriately structured
format [4].
As a rule, a specially developed wrapper is required for
each individual data source, because of the different and unique structures of
websites. The WWW is also extremely dynamic and continually evolving, which
results in frequent changes in the structures of websites. Consequently, it is
often necessary to constantly update or even completely rewrite existing
wrappers, in order to maintain the desired data extraction capabilities [1].
The Extensible Markup Language (XML) has the potential to alleviate such
problems. Whereas HTML is presentation oriented, XML keeps the data structure separate
from the presentation. However, it may take some time before all data is
provided in the XML format, and it remains to be seen whether XML can establish
itself in all areas of electronic information processing [11]. Taking into
consideration that XML documents are based on varying Document Type Definitions
(DTD) or XML-Schemas, the current problems regarding data extraction from HTML
documents can be reduced, but not completely resolved. Wrappers will,
therefore, retain an important role in the integration of data from WWW sources
for some time to come.
Wrapper-Generating Toolkits
Every wrapper can be manually developed from scratch, for example, in an
established programming language using regular expressions. For smaller
applications, this can prove to be a sensible approach. However, if the use of
a larger number of wrappers is required, this inevitably leads to the use of
so-called toolkits, which can generate a complete wrapper based on user defined
parameters for a given data source. One of the most important features of
generated wrappers is the format in which the extracted data can be exported.
If, for example, the extracted data is converted into an XML format, then it
can be imported and processed by a large number
of software applications. Toolkits for generating wrappers can be
differentiated in a number of ways. They can be categorised by their output
methods, interface type, Web crawling capability, use of a graphical user
interface (GUI) and several other characteristics. Laender et al.
[12] categorise a number of toolkits based on the methods used for generating
wrappers. These methods include specially designed wrapper development
languages and algorithms based on HTML-awareness, induction, modelling,
ontology and natural language processing. However, a detailed presentation of
such technical details is beyond the scope of this survey paper. Therefore, the
toolkits are simply divided into two basic categories based on commercial and
non-commercial availability.
The wrapper generating programs within both of these categories offer several
different means of user interaction. Some toolkits are solely based on command
lines and require routines developed in a pre-determined unique scripting
language, in order to generate an appropriate wrapper for a specified data
source. These wrapper development scripting languages are used in standard text
editors and can be seen as application specific alternatives to general-purpose
languages such as Perl and Java. A large number of toolkits offer a GUI,
whereby the relevant data within an HTML document is highlighted with a mouse,
and the program then generates a wrapper based on the specified information.
Several toolkits combine both of the features described above. Initially, the
relevant data is highlighted with a mouse and the program generates a wrapper
from this input. If the automatically generated result does not meet the
specified requirements, the user has the additional possibility of implementing
changes via an editor integrated within the toolkit. Whether frequent
corrections are necessary or not depends, largely, on the underlying algorithms
and the functional maturity of the toolkit.
For more information, please visit our
website: http://www.knowlesys.com
Date: 16 June 2008, Monday
Comments (0) | Add Comment
