
See Our team
Wondering how we keep quality?
Got unsolved questions? Ask Questions
GATE
GMAT
CBSE
NCERT
Career
Interview
Railway
UPSC
NID
NIFT-UG
NIFT-PG
PHP
AJAX
JavaScript
Node Js
Shell Script
Research
Data Warehousing & DataMinig [10CS755] unit-8
UNIT -8 Web Mining -8.1 Introduction
Web mining - is the application of data mining techniques to discover patterns from the Web. According to analysis
targets, web mining can be divided into three different types, which are Web usage mining, Web content mining and
Web structure mining.
8.2 Web Mining
Web content mining is the mining, extraction and integration of useful data, information and knowledge from Web
page content. The heterogeneity and the lack of structure that permeates much of the ever-expanding information
sources on the World Wide Web, such as hypertext documents, makes automated discovery, organization, and
search and indexing tools of the Internet and the World Wide Web such as Lycos, Alta Vista, WebCrawler,
ALIWEB [6], MetaCrawler, and others provide some comfort to users, but they do not generally provide structural
information nor categorize, filter, or interpret documents. In recent years these factors have prompted researchers to
develop more intelligent tools for information retrieval, such as intelligent web agents, as well as to extend database
and data mining techniques to provide a higher level of organization for semi-structured data available on the web.
The agent-based approach to web mining involves the development of sophisticated AI systems that can act
autonomously or semi-autonomously on behalf of a particular user, to discover and organize web-based information.
Web content mining is differentiated from two different points of view:[1] Information Retrieval View and Database
View. R. Kosala et al.[2] summarized the research works done for unstructured data and semi-structured data from
information retrieval view. It shows that most of the researches use bag of words, which is based on the statistics
about single words in isolation, to represent unstructured text and take single word found in the training corpus as
features. For the semi-structured data, all the works utilize the HTML structures inside the documents and some
utilized the hyperlink structure between the documents for document representation. As for the database view, in
order to have the better information management and querying on the web, the mining always tries to infer the
structure of the web site to transform a web site to become a database.
There are several ways to represent documents; vector space model is typically used. The documents constitute the
whole vector space. If a term t occurs n(D, t) in document D, the t-th coordinate of D is n(D, t) . When the length of
the words in a document goes to , D maxt n(D, t) . This representation does not realize the importance of words
in a document. To resolve this, tf-idf (Term Frequency Times Inverse Document Frequency) is introduced.
By multi-scanning the document, we can implement feature selection. Under the condition that the category result is
rarely affected, the extraction of feature subset is needed. The general algorithm is to construct an evaluating
function to evaluate the features. As feature set, Information Gain, Cross Entropy, Mutual Information, and Odds
Ratio are usually used. The classifier and pattern analysis methods of text data mining are very similar to traditional
data mining techniques. The usual evaluative merits are Classification Accuracy, Precision, Recall and Information
Score.
8.3 Text mining
Text mining, also referred to as text data mining, roughly equivalent to text analytics, refers to the process of
deriving high-quality information from text. High-quality information is typically derived through the devising of
patterns and trends through means such as statistical pattern learning. Text mining usually involves the process of
structuring the input text (usually parsing, along with the addition of some derived linguistic features and the
removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally
evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of
relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering,
concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and
entity relation modeling (i.e., learning relations between named entities).Text analysis involves information
retrieval, lexical analysis to study word frequency distributions, pattern recognition, tagging/annotation, information
extraction, data mining techniques including link and association analysis, visualization, and predictive analytics.
The overarching goal is, essentially, to turn text into data for analysis, via application of natural language processing
(NLP) and analytical methods.A typical application is to scan a set of documents written in a natural language and
either model the document set for predictive classification purposes or populate a database or search index with the
information extracted.
8.4 Generalization of Structured Data
An important feature of object-relational and object-oriented databases is their capabilityof storing, accessing, and
modeling complex structure-valued data, such as set- and list-valued data and data with nested structures.
How can generalization be performed on such data? Lets start by looking at thegeneralization of set-valued, listvalued,
and sequence-valued attributes.
A set-valued attribute may be of homogeneous or heterogeneous type. Typically, set-valued data can be generalized
by (1) generalization of each value in the set to its corresponding higher-level concept, or (2) derivation of the
general behavior of the set, such as the number of elements in the set, the types or value ranges in the set, the
weighted average for numerical data, or the major clusters formed by the set. Moreover, generalization can be
performed by applying different generalization operators to explore alternative generalization paths. In this case,
the result of generalization is a heterogeneous set.
8.5 Spatial Data Mining
A spatial database stores a large amount of space-related data, such as maps, preprocessed remote sensing or
medical imaging data, and VLSI chip layout data. Spatial databases have many features distinguishing them from
relational databases. They carry topological and/or distance information, usually organized by sophisticated,
multidimensional spatial indexing structures that are accessed by spatial data access methods and often require
spatial reasoning, geometric computation, and spatial knowledge representation techniques.
Spatial data mining refers to the extraction of knowledge, spatial relationships, or other interesting patterns not
explicitly stored in spatial databases. Such mining demands an integration of data miningwith spatial database
technologies. It can be used for understanding spatial data, discovering spatial relationships and relationships
between spatial and nonspatial data, constructing spatial knowledge bases, reorganizing spatial databases, and
optimizing spatial queries. It is expected to have wide applications in geographic information systems, geo
marketing, remote sensing, image database exploration, medical imaging, navigation, traffic control, environmental
studies, and many other areas where spatial data are used. A crucial challenge to spatial data mining is the
exploration of efficient spatial data mining techniques due to the huge amount of spatial data and the complexity of
spatial data types and spatial access methods. What about using statistical techniques for spatial data mining?
Statistical spatial data analysis has been a popular approach to analyzing spatial data and exploring geographic
information. The term geo statistics is often associated with continuous geographic space, whereas the term spatial
statistics is often associated with discrete space. In a statistical model that handles non spatial data, one usually
assumes statistical independence among different portions of data. However, different from traditional data sets,
there is no such independence among spatially distributed data because in reality, spatial objects are often
interrelated, or more exactly spatially co-located, in the sense that the closer the two objects are located, the more
likely they share similar properties. For example, nature resource, climate, temperature, and economic situations are
likely to be similar in geographically
closely located regions. People even consider this as the first law of geography: Everything is related to everything
else, but nearby things are more related than distant things. Such a property of close interdependency across
nearby space leads to the notion of spatial autocorrelation. Based on this notion, spatial statistical modeling methods
have been developed with good success. Spatial data mining will further develop spatial statistical analysis methods
and extend them for huge amounts of spatial data, with more emphasis on efficiency, scalability, cooperation with
database and data warehouse systems, improved user interaction, and the discovery of new types of knowledge.
8.6 Mining spatio-temporal data
Spatio-temporal data mining is an emerging research area dedicated to the development and application of novel
computational techniques for the analysis of large spatio-temporal databases. The main impulse to research in this
subfield of data mining comes from the large amount of & spatial data made available by GIS, CAD, robotics and
computer vision applications, computational biology, and mobile computing applications; & temporal data obtained
by registering events (e.g., telecommunication or web traffic data) and monitoring processes and workflows.Both the
temporal and spatial dimensions add substantial complexity to data mining tasks.
First of all, the spatial relations, both metric (such as distance) and non-metric (such as topology, direction,
shape, etc.) and the temporal relations (such as before and after) are information bearing and therefore need to be
considered in the data mining methods.
Secondly, some spatial and temporal relations are implicitly defined, that is, they are not explicitly encoded in a
database. These relations must be extracted from the data and there is a trade-off between pre computing them
before the actual mining process starts (eager approach) and computing them on-the-fly when they are actually
needed (lazy approach). Moreover, despite much formalization of space
and time relations available in spatio-temporal reasoning, the extraction of spatial/ temporal relations implicitly
defined in the data introduces some degree of fuzziness that may have a large impact on the results of the data
mining process.
Thirdly, working at the level of stored data, that is, geometric representations (points, lines and regions) for
spatial data or time stamps for temporal data, is often undesirable. For instance, urban planning researchers are
interested in possible relations between two roads, which either cross each other, or run parallel, or can be confluent,
independently of the fact that the two roads are represented by one or
more tuples of a relational table of Blines or Bregions. Therefore, complex transformations are required to
describe the units of analysis at higher conceptual levels, where human-interpretable properties and relations are
expressed. Fourthly, spatial resolution or temporal granularity can have direct impact on the strength of patterns that
can be discovered in the datasets. Interesting patterns are more likely to be discovered at the lowest
resolution/granularity level. On the other hand, large support is more likely to exist at higher levels. Fifthly, many
rules of qualitative reasoning on spatial and temporal data (e.g., transitive properties for temporal relations after and
before), as well as spatiotemporal ontologies, provide a valuable source of domain independent knowledge that
should be taken into account when generating patterns. How to express these rules and how to integrate spatiotemporal
reasoning mechanisms in data mining systems are still open problems.