keyword: Web Data Mining - Exploring
Hyperlinks, Contents and Usage Data
Web mining is a rapid growing research area. It consists of
Web usage mining, Web structure mining, and Web content mining. Web usage
mining refers to the discovery of user access patterns from Web usage logs. Web
structure mining tries to discover useful knowledge from the structure of
hyperlinks. Web content mining aims to extract/mine useful information or
knowledge from web page contents. This tutorial focuses on Web Content Mining.
Web content mining is related but different from data
mining and text mining. It is related to data mining because many data mining
techniques can be applied in Web content mining. It is related to text mining
because much of the web contents are texts. However, it is also quite different
from data mining because Web data are mainly semi-structured and/or
unstructured, while data mining deals primarily with structured data. Web
content mining is also different from text mining because of the semi-structure
nature of the Web, while text mining focuses on unstructured texts. Web content
mining thus requires creative applications of data mining and/or text mining
techniques and also its own unique approaches. In the past few years, there was
a rapid expansion of activities in the Web content mining area. This is not
surprising because of the phenomenal growth of the Web contents and significant
economic benefit of such mining. However, due to the heterogeneity and the lack
of structure of Web data, automated discovery of targeted or unexpected
knowledge information still present many challenging research problems. In this
tutorial, we will examine the following important Web content mining problems
and discuss existing techniques for solving these problems. Some other emerging
problems will also be surveyed.
- Data/information extraction: Our focus will be on extraction of structured data from Web
pages, such as products and search results. Extracting such data allows
one to provide services. Two main types of techniques, machine learning
and automatic extraction are covered. - Web information integration and
schema matching: Although the Web contains a
huge amount of data, each web site (or even page) represents similar
information differently. How to identify or match semantically similar
data is a very important problem with many practical applications. Some
existing techniques and problems are examined. - Opinion extraction from online
sources: There are many online opinion
sources, e.g., customer reviews of products, forums, blogs and chat rooms.
Mining opinions (especially consumer opinions) is of great importance for
marketing intelligence and product benchmarking. We will introduce a few
tasks and techniques to mine such sources. - Knowledge synthesis: Concept hierarchies or ontology are useful in many
applications. However, generating them manually is very time consuming. A
few existing methods that explores the information redundancy of the Web
will be presented. The main application is to synthesize and organize the
pieces of information on the Web to give the user a coherent picture of
the topic domain.. - Segmenting Web pages and
detecting noise: In many Web applications, one
only wants the main content of the Web page without advertisements,
navigation links, copyright notices. Automatically segmenting Web page to
extract the main content of the pages is interesting problem. A number of
interesting techniques have been proposed in the past few years.
All these tasks present major research challenges and their
solutions also have immediate real-life applications. The tutorial will start
with a short motivation of the Web content mining. We then discuss the
difference between web content mining and text mining, and between Web content
mining and data mining. This is followed by presenting the above problems and
current state-of-the-art techniques. Various examples will also be given to
help participants to better understand how this technology can be deployed and
to help businesses. All parts of the tutorial will have a mix of research and
industry flavor, addressing seminal research concepts and looking at the
technology from an industry angle.
For more information, please visit our
website: http://www.knowlesys.com
Date: 16 June 2008, Monday
Comments (0) | Add Comment
