Data Warehousing & Data Mining Code 53
FAQ
Ans:
Some of the
disadvantages of file management system over database management system are:
n Data redundancy and inconsistency
n Difficulty in accessing
data
data
n Difficulty in integrating data into new enterprise level
applications because of varying formats
n Lack of support for concurrent updates by multiple users.
n Lack of inherent security
2. Are relational
databases the only possible type of database models?
Ans:
No. Apart
from relational, other models include network and hierarchical models. However,
these two models are obsolete. Nowadays, relational and object-oriented models
are preferred.
3. What is referential
integrity and how it is achieved in a relational database?
Ans:
Referential
integrity is a feature of DBMS that prevents the user from entering
inconsistent data. This is mainly achieved by having a foreign key constraint
on a table.
4. What are the higher
normal forms?
Ans:
A normal
form considered higher than 3NF is the Boyce Codd Normal Form (BCNF). The BCNF
differs from the 3 NF when there are more than one composite and disjoint
candidate keys.
------------------------------------------------------------------------------------------------------------
5. On which layer of
application architecture does data warehouse operate?
Ans:
A data
warehouse is a server-side repository to store data.
6. What is a data
warehouse?
Ans:
A data
warehouse is a huge repository of data used for very complex business analysis.
7. What are the
benefits of data warehousing?
Ans:
The main
benefit of data warehousing is to keep data in such a form that complex
business analysis can be done in minimum amount of time.
8. What are the
application areas of a data warehouse?
Ans:
There are
various application areas of a data warehouse. Some of these are:
n Airlines
n Meteorology
n Logistics
n Insurance
9. Where can you use
data warehouse successfully?
Ans:
It is ideal
to implement data warehousing when there is a large amount of historical data
that needs to be processed for extensive analytical purposes.
10. When and where is a
data mart useful?
Ans:
A data mart
helps to provide data for conducting analysis at a specialized level. It will
always be used at a strategic business unit level such as the department level
for a business unit.
11. What does
historical data support in data warehouse?
Ans:
Historical
data is used for supplying pre-processed and non-pre-processed data for
conducting business analysis.
-----------------------------------------------------------------------------------------------------------
Solutions to
Chapter Three Questions
12. What is query
redirection?
Ans.
When the
available data grows beyond size, partitioning becomes essential. Query
redirection means that the queries should be directed to the appropriate
partitions that store the data required by the query.
FAQ
13. What are the
benefits of OLTP?
Ans:
On Line
Transaction Processing (OLTP) assists in storing current business transactional
data. It also supports a large number of concurrent users to access data at the
same time.
14. Why the OLTP cannot
provide history data for analysis?
Ans:
Data in a
data warehouse comes from an OLTP system only. However, it cannot be directly
used for analysis. The reason is that the data in OLTP systems is not organized
to give results quickly from billions of records. In a data warehouse, data is
classified into various categories and so it is possible to give the results
quickly.
15. Why is the data in
the data warehouse not stored in a normalized form as in OLTP?
Ans:
The
objective of storing data in a normalized form in OLTP is to reduce redundancy
and minimize disk storage. The key objective in a data warehouse is to enhance
the query response time. The easier the access to data better will be the query
response time. Hence, the normalization rules do not matter in a data warehouse.
16. An integral part of
OLTP is its support for hundreds of concurrent users. The number of concurrent
users supported by a data warehouse is comparable to OLTP. Is this statement
true or false? Justify your answer.
Ans:
The
statement is false. This is because the number of people involved in data
analysis is very low as compared to front-end users who engage in transactional
data. Moreover, the percentage of CPU usage per user is very high in case of
data warehousing as compared to OLTP users.
17. Explain why a data
warehouse does not use current or OLTP data for analysis.
Ans:
The main
purpose of a data warehouse is to provide historical data to analyze business
trends. Therefore, historical data needs to be a snapshot of events over time
and not only on the current data.
18. What is the
advantage of MOLAP as storage model?
Ans:
MOLAP
dimensions provide better query performance. Here the contents of the dimension
are processed and stored on the
Analysis Server and not on a Relational Server.
19. What kind of data
does a fact table contain?
Ans:
A fact table
contains numeric data.
20. What are the
different OLAP storage models?
Ans:
Following
are the different OLAP storage models: MOLAP, ROLAP and HOLAP.
21. A data analysis has
to be done in the fastest possible means on data stored in Multi-dimensional
format. Which storage model is best suited in this case?
Ans:
MOLAP.
---------------------------------------------------------------------------------------------------------
FAQ
22. Name the 2
important parameters that decide the granularity of partitions.
Ans:
Two
important factors that decide the granularity of partitions are the overall
size and manageability of the system. Both parameters are to be balanced
against each other while deciding on a partitioning strategy. Suppose a data
containing information about the population is partitioned on the basis of
state, the two maintenance related issues that could be faced by the
administrator are:
23. Are there any
disadvantages of data partitioning?
Ans:
Data
partitioning is by and large an advantageous technique for improving
performance. However, it increases the implementation complexity and imposes
constraints in query design.
24. Can partitions be
indexed?
Ans:
Yes,
partitions can be indexed if supported by the platform. For example, in Oracle
9i, you can create various types of partitions in indexes.
25. If you have a huge
amount of historical data, which is too old to be useful often but cannot be
discarded, then can partitioning help?
Ans:
Essentially
the answer to this question depends on various factors such as availability of
resources and design strategies. However, you can partition data on the basis
of the date that it was last accessed and keep the historical data on a
separate partition. In fact, you can use Stripping to keep it on a separate
disk to improve access speed to the more useful data.
------------------------------------------------------------------------------------------------------------
FAQ
26. Are there any risks
associated with aggregation?
Ans:
The main
risk associated with aggregates is that of increase in disk storage space.
27. Once created, is an
aggregate permanent?
Ans:
No,
aggregates keep changing as per the need of the business. In fact, they can be
taken offline or put online anytime by the administrator. Aggregates, which
have become obsolete, can also be deleted to free up disk space.
28. Can operations such
as MIN and MAX value be determined once a summary table has been created?
Ans:
Operations
such as MIN and MAX cannot be determined correctly once the summary table has
been created. To determine their value they must be calculated and stored at
the time that the summary table was derived from the base table.
29. How much storage
increase might be required in the data warehouse system when using aggregates?
Ans:
The storage
needs typically increase by a factor of 1 or sometimes even 2 for aggregates.
---------------------------------------------------------------------------------------------------------
FAQ
30. What are conformed
dimensions?
Ans:
A conformed
dimension is the one whose meaning is independent of the fact table from which
it is being referred to.
31. What are virtual
data marts?
Ans:
Virtual data
marts are logical views of multiple physical data marts based on user
requirement.
32. Which tool supports
data mart based data warehouse architectures?
Ans:
Informtica
is commonly used for implementing data mart based data warehouse architectures.
33. Is the data in data
marts also historical like in data warehouses?
Ans:
The data in data marts is
historical only to some extent. In fact, it is not the same as the data in data
warehouse because of the difference in the purpose and approaches of the two.
------------------------------------------------------------------------------------------------------------
FAQ
34. How can you
classify metadata?
Ans:
You can
classify metadata according the use of metadata as:
n Administrative metadata: Metadata that describes the data used for managing the data in
terms of statistics such as time of creation, access rights, and last access
time.
n Structural metadata: Metadata that describes the structure of the data.
n Descriptive metadata: Metadata that describes the purpose or functionality of the
data.
35. What is backroom
metadata?
Ans:
Backroom
metadata is the metadata related to the process of extracting, cleaning, and
loading. It is of use to the DBA and business users but not to the end-user.
36. What is a metadata
catalogue?
Ans:
A metadata
catalogue is the same as a metadata repository. It is also called metadatabase.
37. Are there any tools
for metadata management?
Ans:
Yes, there
are various tools that facilitate metadata management. One such Windows based
tool is Saphir. SQL Server 2000 also enables metadata management to some
extent.
-----------------------------------------------------------------------------------------------------------
FAQ
38. Are the system and
process management devoid of any manual intervention considering that process
manager is a tool and not a person?
Ans:
No. Although
the system and process manager are themselves tools that automate system and
process management in data warehouses, they must be configured and sometimes
handled through manual intervention at times. These tasks may be done by the
Database Administrator.
39. Does SQL Server
also provide system managers?
Ans:
Yes. SQL
Server includes various components that enable system management through
management and security services:
40. What is Oracle
Warehouse Builder(OWB)?
Ans:
It is one of
the commonly used data warehouse development tools with various advanced
features such as support for large databases, automated summary management, and
embedded multidimensional OLAP engine. Unlike SQL Server, which is only for the
Windows platform, OWB can be used on all platforms. It is also more fast,
reliable, and scaleable than SQL Server.
41. What is replication?
Ans:
------------------------------------------------------------------------------------------------------------
FAQ
42. What is KDD Process?
Ans:
The unifying
goal of the KDD process is to extract knowledge from data in the context of
large databases. It does this by using data mining methods (algorithms) to
extract (identify) what is deemed knowledge, according to the specifications of
measures and thresholds, using a database along with any required
preprocessing, subsampling, and transformations of that database.
43. What is Data
Visualization?
Ans:
Data
Visualization presents data in three dimensions and colors to help users view
complex patterns. They also provide advanced manipulation capabilities to
slice, rotate or zoom the objects to identify patterns.
44. What are the
constituents of Multidimensional objects?
Ans:
Dimensions
and Measures.
45. What does level
specify within dimension?
Ans:
Levels
specify the contents and structure of the dimension's hierarchy.
46. What is data mining?
Ans:
Data Mining
is the process of finding new and potentially useful knowledge from data
47. What does Data
Mining Software do?
Ans:
A Data
Mining Software searches large volume of data, looking for patterns that
accurately predict behavior, such as customers most likely to maintain
relationship with the company etc. Common techniques employed by Data Mining
Software include Neural Networks, Decision Trees and standard statistical
modeling.
48. What is Oracle Data
Mining?
Ans:
Oracle Data
Mining is enterprise data mining software that combines the ease of a
Windows-based client with the power of a fully scalable, multi-algorithmic,
UNIX server-based solution. Oracle Data Mining provides comprehensive
predictive modeling capabilities that take advantage of parallel computing
techniques to rapidly extract valuable customer intelligence information.
Oracle Data Mining can optionally generate deployable models in C, C++, or Java
code, delivering the "power of prediction" to call center, campaign
management, and Web-based applications enterprise-wide.
49. How does data
mining differ from OLAP?
Ans:
Simply put,
OLAP compares and data mining predicts. OLAP performs roll-ups, aggregations,
and calculations, and it compares multiple results in a clearly organized graphical
or tabular display. Data mining analyzes data on historical cases to discover
patterns and uses the patterns to make predictions or estimates of outcomes for
unknown cases. An analyst may use OLAP to discover a business problem, and then
apply data mining to make the predictions necessary for a solution. An OLAP
user can apply data mining to discover meaningful dimensions that should be
compared. OLAP may be used to perform roll-ups and aggregations needed by the
data mining tool. Finally, OLAP can compare data mining predictions or values
derived from predictions.
50. What are some
typical data mining applications?
Ans:
Following
are some of the data mining applications:
n Customer retention
n Cross selling
n Response modeling / target marketing
n Profitability analysis
n Product affinity analysis
------------------------------------------------------------------------------------------------------------
FAQ
51. What is Noisy Data?
Ans:
Noise is a
random error or variance in data. It can happen because of:
n Faulty data collection and data entry mistake such as a typing
mistake
n Data transmission and storage problem
n Inconsistency in naming convention
Noise makes
data inaccurate for predictions and renders it futile for mining systems.
52. Which are the major
data mining tasks?
Ans:
The main
data mining tasks include:
n Classification
n Clustering
n Associations
n Prediction
n Characterization and Discrimination
n Evolution Analysis
53. What are some other
Data Mining Languages and standardization of primitives apart from DMQL?
Ans:
Some other
Data Mining Languages and standardizations of primitives apart from DMQL
include:
n MSQL
n Mine Rule
n Query flocks based on Data log syntax
n OLEDB for DM
n CRISP-DM
54. Which Data Mining
tools are used commercially?
Ans:
Some Data
Mining tools used commercially are:
n Clementine
n Darwin
n Enterprise Miner
n Intelligent Miner
n Mine Set
55. How can noisy data
be smoothened?
Ans:
Noisy data
can be smoothened using the following techniques:
n Binning
n Clustering
n Computer/Human inspection
n Regression
---------------------------------------------------------------------------------------------------------
FAQ
56. What are variations
to Apriori algorithm?
Ans:
Following
are some of the variations to Apriori algorithm that improves the efficiency of
the original algorithm:
n Transaction reduction: Reducing the number of transaction
scanned in future iterations
n Partitioning: Partitioning data to find candidate itemsets
n Sampling: Mining on a subset of the given data
n Dynamic itemset counting: Adding candidate itemsets at different
points during the scan
57. Which is the best
approach when we are interested in finding all possible interactions among a
set of attributes?
Ans:
The best
approach to find all possible interactions among a set of attributes is
association rule mining.
58. What is over
fitting in neural network?
Ans:
Over fitting
is a common problem in neural network design. Over fitting occurs when a
network has memorized the training set but has not learned to generalize to new
inputs. Over fitting produces a relatively small error on the training set but
will produce a much larger error when new data is presented to the network.
59. What is back
propagation neural network?
Ans:
The back
propagation is a neural network algorithm for classification that employs a
method of gradient descent. It searches for a set of weights that can model the
data to minimize the mean squared distance between the network's class
prediction and the actual class label of data samples. Rules may be extracted
from trained neural networks in order to help improve the interpretability of
the learned network.
----------------------------------------------------------------------------------------------------------
FAQ
60. What is the
difference between KDD and data mining?
Ans:
KDD refers
to the overall process of discovering useful knowledge from data. It also
includes the choice of encoding schemes, preprocessing, sampling, and
projections of the data prior to the data-mining step.
Data mining
refers to the application of algorithms for extracting patterns from data
without the additional steps of the KDD process. It is essentially the modeling
step in the KDD process.
61. Is data stored for
data mining different from other operational systems?
Ans:
The two
systems differ in the usage patterns associated with data mining. Some of the
basic differences are:
n Operational systems support transactional processing whereas
data mining systems support analytical processing.
n Operational systems are process-oriented whereas data mining
systems are subject-oriented.
n Operational systems are concerned with current data whereas data
mining systems deal with historical data.
n Operational systems are updated frequently and have volatile
data whereas data mining systems have non-volatile data and are rarely changed.
n Operational systems are optimized for fast updates whereas data
mining systems are optimized for fast retrievals.
62. What are the
primary goals of the KDD process?
Ans:
The two
primary goals of data mining are:
n Prediction of unknown or future values
n Description, which focuses on finding human-interpretable
patterns
63. Does KDD need to be
scalable?
Ans:
KDD should
be scalable because of the ever-increasing data in enterprises. Limiting the
size of a data mining system will affect the accuracy of predictions.
------------------------------------------------------------------------------------------------------------
FAQ
64. What are different
types of Query Answering Mechanisms?
Ans:
Query
answering can be classified into two categories based on their method of
response:
n Direct Query answering: It means that query is answered by returning exactly what is
being asked.
n Intelligent Query answering: It refers to analyzing the intent of the query and
providing generalized, neighborhood, or associated information relevant to the
query.
65. Are there any
social concerns related to data mining?
Ans:
The most
important social issue is privacy of data pertaining to individual. KDD poses a
threat to privacy. The discovered patterns often classify individuals into
categories, revealing their confidential personal information. Moreover,
it raises very sensitive and controversial issues, such as those that
involve race, gender or religion. In addition, it may correlate and disclose
confidential, sensitive facts about individuals.
66. What is Market Basket
Analysis?
Ans:
Market
Basket Analysis is one of the most common and useful techniques for data
analysis for marketing. The purpose of market basket analysis is to determine
what products customers purchase together in a market. This can be helpful to a
retailer, merchant or manufacturer or any other type of organization interested
in studying consumer buying patterns.
67. What is Visual Data
Mining?
Ans:
Visual data
mining integrates data mining and data visualization in order to discover
implicit and useful knowledge from large data sets.
------------------------------------------------------------------------------------------------------------
I. Note: Answer all the questions.
68. What is Normalization? What are the different forms of Normalization ?
The usual approach in normalization in database applications is to ensure that the data is divided into two or more tables, such that when the data in one of them is updated, it does not lead to anamolies of data (The student is advised to refer any book on data base management systems for details, if interested).
The idea is to ensure that when combined, the data available is consistent. However, in data warehousing, one may even tend to break the large table into several “denormalized” smaller tables. This may lead to lots of extra space being used. But it helps in an indirect way – It avoids the overheads of joining the data during queries.
69. Define Data warehouse. What are roles of education in a data warehousing delivery process?
Data Warehouse: In it’s simplest form, a data ware house is a collection of key pieces of information used to manage and direct the business for the most profitable outcome. It would decide the amount of inventory to be held, the no. of employees to be hired, the amount to be procured on loan etc.,.
The above definition may not be precise - but that is how data ware house systems are. There are different definitions given by different authors, but we have this idea in mind and proceed. It is a large collection of data and a set of process managers that use this data to make information available. The data can be meta data, facts, dimensions and aggregations. The process managers can be load managers, ware house managers or query managers. The information made available is such that they allow the end users to make informed decisions.
Roles of education in a data warehousing delivery process:-
This has two roles to play - one to make people, specially top level policy makers, comfortable with the concept. The second role is to aid the prototyping activity. To take care of the education concept, an initial (usually scaled down) prototype is created and people are encouraged to interact with it. This would help achieve both the activities listed above. The users became comfortable with the use of the system and the ware house developer becomes aware of the limitations of his prototype which can be improvised upon.
70.What is process managers? What are the different types of process managers?
Process Managers: These are responsible for the smooth flow, maintainance and upkeep of data into and out of the database.
The main types of process managers are:-
i). Load manager: to take case of source interaction, data transformation and data load.
ii). Ware house manger: to take care of data movement, meta data management and performance
monitoring.
iii). Query manager: to control query scheduling and monitoring.
We shall look into each of them briefly. Before that, we look at a schematic diagram that defines the boundaries of the three types of managers.

71. Give the architectures of data mining systems.

72.What are the guidelines for KDD environment ?
It is customary in the computer industry to formulate rules of thumb that help information technology (IT) specialists to apply new developments. In setting up a reliable data mining environment we may follow the guidelines so that KDD system may work in a manner we desire.
i). Support extremely large data sets
ii). Support hybrid learning
iii). Establish a data warehouse
iv). Introduce data cleaning facilities
v). Facilitate working with dynamic coding
vi). Integrate with decision support system
vii). Choose extendible architecture
viii). Support heterogeneous databases
ix). Introduce client/server architecture
x). Introduce cache optimization
PART - B
II. Answer any FIVE full questions.
73. With the help of a diagram explain architecture of data warehouse.
The architecture for a data ware is indicated below. Before we proceed further, we should be clear about the concept of architecture. It only gives the major items that make up a data ware house. The size and complexity of each of these items depend on the actual size of the ware house itself, the specific requirements of the ware house and the actual details of implementation.

Before looking into the
details of each of the managers we could get a broad idea about their
functionality by mapping the processes that we studied in the previous chapter
to the managers. The extracting and loading processes are taken care of by the
load manager. The processes of cleanup and transformation of data as also of
back up and archiving are the duties of the ware house manage, while the query
manager, as the name implies is to take case of query management.
74. Indicate the important function of a Load Manager, Warehouse Manager.
Important function of Load Manager:
i) To extract data from the source (s)
ii) To load the data into a temporary storage device
iii) To perform simple transformations to map it to the structures of the data ware house.
Important function of Warehouse Manager:
i) Analyze the data to confirm data consistency and data integrity .
ii) Transform and merge the source data from the temporary data storage into the ware house.
iii) Create indexes, cross references, partition views etc.,.
iv) Check for normalization’s.
v) Generate new aggregations, if needed.
vi) Update all existing aggregations
vii) Create backups of data.
viii) Archive the data that needs to be archived.
75. Differentiate between vertical partitioning and horizontal partitioning.
In horizontal partitioning, we simply the first few thousand entries in one partition, the second few thousand in the next and so on. This can be done by partitioning by time, where in all data pertaining to the first month / first year is put in the first partition, the second one in the second partition and so on. The other alternatives can be based on different sized dimensions, partitioning an other dimensions, petitioning on the size of the table and round robin partitions. Each of them have certain advantages as well as disadvantages.
In vertical partitioning, some columns are stored in one partition and certain other columns of the same row in a different partition. This can again be achieved either by normalization or row splitting. We will look into their relative trade offs.
76.What is schema? Distinguish between facts and dimensions.
A schema, by definition, is a logical arrangements of facts that facilitate ease of storage and retrieval, as described by the end users. The end user is not bothered about the overall arrangements of the data or the fields in it. For example, a sales executive, trying to project the sales of a particular item is only interested in the sales details of that item where as a tax practitioner looking at the same data will be interested only in the amounts received by the company and the profits made.
The star schema looks a good solution to the problem of ware housing. It simply states that one should identify the facts and store it in the read-only area and the dimensions surround the area. Whereas the dimensions are liable to change, the facts are not. But given a set of raw data from the sources, how does one identify the facts and the dimensions? It is not always easy, but the following steps can help in that direction.
i) Look for the fundamental transactions in the entire business process. These basic entities
are the facts.
ii) Find out the important dimensions that apply to each of these facts. They are the candidates
for dimension tables.
iii) Ensure that facts do not include those candidates that are actually dimensions, with a set of
facts attached to it.
iv) Ensure that dimensions do not include these candidates that are actually facts.
77. What is an event in data warehousing? List any five events.
An event is defined as a measurable, observable occurrence of a defined action. If this definition is quite vague, it is because it encompasses a very large set of operations. The event manager is a software that continuously monitors the system for the occurrence of the event and then take any action that is suitable (Note that the event is a “measurable and observable” occurrence). The action to be taken is also normally specific to the event.
A partial list of the common events that need to be monitored are as follows:
i). Running out of memory space.
ii). A process dying
iii). A process using excessing resource
iv). I/O errors
v). Hardware failure
78. What is summary table? Describe the aspects to be looked into while designing a summary table.
The main purpose of using summary tables is to cut down the time taken to execute a specific query.
The main methodology involves minimizing the volume of data being scanned each time the query is to be
answered. In other words, partial answers to the query are already made available. For example, in the
above cited example of mobile market, if one expects
i) the citizens above 18 years of age
ii) with salaries greater than 15,000 and
iii) with professions that involve traveling are the potential customers, then, every time the query is to be processed (may be every month or every quarter), one will have to look at the entire data base to compute these values and then combine them suitably to get the relevant answers. The other method is to prepare summary tables, which have the values pertaining toe ach of these sub-queries, before hand, and then combine them as and when the query is raised.
Summary table are designed by following the steps given below:
i) Decide the dimensions along which aggregation is to be done.
ii) Determine the aggregation of multiple facts.
iii) Aggregate multiple facts into the summary table.
iv) Determine the level of aggregation and the extent of embedding.
v) Design time into the table.
vi) Index the summary table.
79 List the significant issues in automatic cluster detection.
Most of the issues related to automatic cluster detection are connected to the kinds of questions we want to be answered in the data mining project, or data preparation for their successful application.
i). Distance measure
Most clustering techniques use for the distance measure the Euclidean distance formula (square root of the sum of the squares of distances along each attribute axes).
Non-numeric variables must be transformed and scaled before the clustering can take place. Depending
on this transformations, the categorical variables may dominate clustering results or they may be even
completely ignored.
ii). Choice of the right number of clusters
If the number of clusters k in the K-means method is not chosen so to match the natural structure of the data, the results will not be good. The proper way t alleviate this is to experiment with different values for k. In principle, the best k value will exhibit the smallest intra-cluster distances and largest inter-cluster distances.
iii). Cluster interpretation
Once the clusters are discovered they have to be interpreted in order to have some value for the data mining project.
81.Define data marting. List the reasons for data marting.
The data mart stores a subset of the data available in the ware house, so that one need not always have to scan through the entire content of the ware house. It is similar to a retail outlet. A data mart speeds up the queries, since the volume of data to be scanned is much less. It also helps to have tail or made processes for different access tools, imposing control strategies etc.,.
Following are the reasons for which data marts are created:
i) Since the volume of data scanned is small, they speed up the query processing.
ii) Data can be structured in a form suitable for a user access too
iii) Data can be segmented or partitioned so that they can be used on different platforms and also different control strategies become applicable.
82. Explain how to categorize data mining system.
There are many data mining systems available or being developed. Some are specialized systems dedicated to a given data source or are confined to limited data mining functionalities, other are more versatile and comprehensive. Data mining systems can be categorized according to various criteria among other classification are the following:
a) Classification according to the type of data source mined: this classification categorizes data mining systems according to the type of data handled such as spatial data, multimedia data, time-series data, text data, World Wide Web, etc.
b) Classification according to the data model drawn on: this classification categorizes data mining systems based on the data model involved such as relational database, object-oriented database, data warehouse, transactional, etc.
c) Classification according to the king of knowledge discovered: this classification categorizes data mining systems based on the kind of knowledge discovered or data mining functionalities, such as characterization, discrimination, association, classification, clustering, etc. Some systems tend to be comprehensive systems offering several data mining functionalities together.
d) Classification according to mining techniques used: Data mining systems employ and provide different techniques. This classification categorizes data mining systems according to the data analysis approach used such as machine learning, neural networks, genetic algorithms, statistics, visualization, database oriented or data warehouse-oriented, etc.
83. List and explain different kind of data that can be mined.
Different kind of data that can be mined are listed below:-
i). Flat files: Flat files are actually the most common data source for data mining algorithms, especially at the research level.
ii). Relational Databases: A relational database consists of a set of tables containing either values of entity attributes, or values of attributes from entity relationships.
iii). Data Warehouses: A data warehouse as a storehouse, is a repository of data collected from multiple data sources (often heterogeneous) and is intended to be used as a whole under the same unified schema.
iv). Multimedia Databases: Multimedia databases include video, images, audio and text media. They can be stored on extended object-relational or object-oriented databases, or simply on a file system.
v). Spatial Databases: Spatial databases are databases that in addition to usual data, store geographical information like maps, and global or regional positioning.
vi). Time-Series Databases: Time-series databases contain time related data such stock market data or logged activities. These databases usually have a continuous flow of new data coming in, which sometimes causes the need for a challenging real time analysis.
vii). World Wide Web: The World Wide Web is the most heterogeneous and dynamic repository available. A very large number of authors and publishers are continuously contributing to its growth and metamorphosis and a massive number of users are accessing its resources daily.
84. Give the syntax for task relevant data specification.
Syntax for tax-relevant data specification:-
The first step in defining a data mining task is the specification of the task-relevant data, that is, the data on which mining is to be performed. This involves specifying the database and tables or data warehouse containing the relevant data, conditions for selecting the relevant data, the relevant attributes or dimensions for exploration, and instructions regarding the ordering or grouping of the data retrieved. DMQL provides clauses for the clauses for the specification of such information, as follows:-
i). use database (database_name) or use data warehouse (data_warehouse_name): The use clause directs the mining task to the database or data warehouse specified.
ii). from (relation(s)/cube(s)) [where(condition)]: The from and where clauses respectively specify the database tables or data cubes involved, and the conditions defining the data to be retrieved.
iii). in relevance to (attribute_or_dimension_list): This clause lists the attributes or dimensions for exploration.
iv). order by (order_list): The order by clause specifies the sorting order of the task relevant data.
v). group by (grouping_list): the group by clause specifies criteria for grouping the data.
vi). having (conditions): The having cluase specifies the condition by which groups of data are considered relevant.
85. Explain the designing of GUI based on data mining query language.
A data mining query language provides necessary primitives that allow users to communicate with data mining systems. But novice users may find data mining query language difficult to use and the syntax difficult to remember. Instead , user may prefer to communicate with data mining systems through a graphical user interface (GUI). In relational database technology , SQL serves as a standard core language for relational systems , on top of which GUIs can easily be designed. Similarly, a data mining query language may serve as a core language for data mining system implementations, providing a basis for the development of GUI for effective data mining.
A data mining GUI may consist of the following functional components:-
a) Data collection and data mining query composition - This component allows the user to specify task-relevant data sets and to compose data mining queries. It is similar to GUIs used for the specification of relational queries.
b) Presentation of discovered patterns – This component allows the display of the discovered patterns in various forms, including tables, graphs, charts, curves and other visualization techniques.
c) Hierarchy specification and manipulation - This component allows for concept hierarchy specification , either manually by the user or automatically. In addition , this component should allow concept hierarchies to be modified by the user or adjusted automatically based on a given data set distribution.
d) Manipulation of data mining primitives – This component may allow the dynamic adjustment of data mining thresholds, as well as the selection, display and modification of concept hierarchies. It may also allow the modification of previous data mining queries or conditions.
e) Interactive multilevel mining – This component should allow roll-up or drill-down operations on discovered patterns.
f) Other miscellaneous information – This component may include on-line help manuals, indexed search , debugging and other interactive graphical facilities.
86. Explain how decision trees are useful in data mining.
Decision trees are powerful and popular tools for classification and prediction. The attractiveness of tree-based methods is due in large part to the fact that, it is simple and decision trees represent rules. Rules can readily be expressed so that we humans can understand them or in a database access language like SQL so that records falling into a particular category may be retrieved.
87 Identify an application and also explain the techniques that can be incorporated in solving the problem using data mining techniques.
Write yourself...
88.Write a short notes on :
i) Data Mining Querying Language
ii) Schedule Manager
iii) Data Formatting.
i) Data Mining Querying Language
A data mining language helps in effective knowledge discovery from the data mining systems. Designing
a comprehensive data mining language is challenging because data mining covers a wide spectrum of
tasks from data characterization to mining association rules, data classification and evolution analysis.
Each task has different requirements. The design of an effective data mining query language requires a
deep understanding of the power, limitation and underlying mechanism of the various kinds of data mining
tasks.
ii) Schedule manager
The scheduling is the key for successful warehouse management. Almost all operations in the ware
house need some type of scheduling. Every operating system will have it’s own scheduler and batch
control mechanism. But these schedulers may not be capable of fully meeting the requirements of a data
warehouse. Hence it is more desirable to have specially designed schedulers to manage the operations.
iii) Data formatting
Final data preparation step which represents syntactic modifications to the data that do not change its
meaning, but are required by the particular modelling tool chosen for the DM task. These include:
a). reordering of the attributes or records: some modelling tools require reordering of the attributes
(or records) in the dataset: putting target attribute at the beginning or at the end, randomizing
order of records (required by neural networks for example)
b). changes related to the constraints of modelling tools: removing commas or tabs, special
characters, trimming strings to maximum allowed number of characters, replacing special
characters with allowed set of special characters.
74. Indicate the important function of a Load Manager, Warehouse Manager.
Important function of Load Manager:
i) To extract data from the source (s)
ii) To load the data into a temporary storage device
iii) To perform simple transformations to map it to the structures of the data ware house.
Important function of Warehouse Manager:
i) Analyze the data to confirm data consistency and data integrity .
ii) Transform and merge the source data from the temporary data storage into the ware house.
iii) Create indexes, cross references, partition views etc.,.
iv) Check for normalization’s.
v) Generate new aggregations, if needed.
vi) Update all existing aggregations
vii) Create backups of data.
viii) Archive the data that needs to be archived.
75. Differentiate between vertical partitioning and horizontal partitioning.
In horizontal partitioning, we simply the first few thousand entries in one partition, the second few thousand in the next and so on. This can be done by partitioning by time, where in all data pertaining to the first month / first year is put in the first partition, the second one in the second partition and so on. The other alternatives can be based on different sized dimensions, partitioning an other dimensions, petitioning on the size of the table and round robin partitions. Each of them have certain advantages as well as disadvantages.
In vertical partitioning, some columns are stored in one partition and certain other columns of the same row in a different partition. This can again be achieved either by normalization or row splitting. We will look into their relative trade offs.
76.What is schema? Distinguish between facts and dimensions.
A schema, by definition, is a logical arrangements of facts that facilitate ease of storage and retrieval, as described by the end users. The end user is not bothered about the overall arrangements of the data or the fields in it. For example, a sales executive, trying to project the sales of a particular item is only interested in the sales details of that item where as a tax practitioner looking at the same data will be interested only in the amounts received by the company and the profits made.
The star schema looks a good solution to the problem of ware housing. It simply states that one should identify the facts and store it in the read-only area and the dimensions surround the area. Whereas the dimensions are liable to change, the facts are not. But given a set of raw data from the sources, how does one identify the facts and the dimensions? It is not always easy, but the following steps can help in that direction.
i) Look for the fundamental transactions in the entire business process. These basic entities
are the facts.
ii) Find out the important dimensions that apply to each of these facts. They are the candidates
for dimension tables.
iii) Ensure that facts do not include those candidates that are actually dimensions, with a set of
facts attached to it.
iv) Ensure that dimensions do not include these candidates that are actually facts.
77. What is an event in data warehousing? List any five events.
An event is defined as a measurable, observable occurrence of a defined action. If this definition is quite vague, it is because it encompasses a very large set of operations. The event manager is a software that continuously monitors the system for the occurrence of the event and then take any action that is suitable (Note that the event is a “measurable and observable” occurrence). The action to be taken is also normally specific to the event.
A partial list of the common events that need to be monitored are as follows:
i). Running out of memory space.
ii). A process dying
iii). A process using excessing resource
iv). I/O errors
v). Hardware failure
78. What is summary table? Describe the aspects to be looked into while designing a summary table.
The main purpose of using summary tables is to cut down the time taken to execute a specific query.
The main methodology involves minimizing the volume of data being scanned each time the query is to be
answered. In other words, partial answers to the query are already made available. For example, in the
above cited example of mobile market, if one expects
i) the citizens above 18 years of age
ii) with salaries greater than 15,000 and
iii) with professions that involve traveling are the potential customers, then, every time the query is to be processed (may be every month or every quarter), one will have to look at the entire data base to compute these values and then combine them suitably to get the relevant answers. The other method is to prepare summary tables, which have the values pertaining toe ach of these sub-queries, before hand, and then combine them as and when the query is raised.
Summary table are designed by following the steps given below:
i) Decide the dimensions along which aggregation is to be done.
ii) Determine the aggregation of multiple facts.
iii) Aggregate multiple facts into the summary table.
iv) Determine the level of aggregation and the extent of embedding.
v) Design time into the table.
vi) Index the summary table.
79 List the significant issues in automatic cluster detection.
Most of the issues related to automatic cluster detection are connected to the kinds of questions we want to be answered in the data mining project, or data preparation for their successful application.
i). Distance measure
Most clustering techniques use for the distance measure the Euclidean distance formula (square root of the sum of the squares of distances along each attribute axes).
Non-numeric variables must be transformed and scaled before the clustering can take place. Depending
on this transformations, the categorical variables may dominate clustering results or they may be even
completely ignored.
ii). Choice of the right number of clusters
If the number of clusters k in the K-means method is not chosen so to match the natural structure of the data, the results will not be good. The proper way t alleviate this is to experiment with different values for k. In principle, the best k value will exhibit the smallest intra-cluster distances and largest inter-cluster distances.
iii). Cluster interpretation
Once the clusters are discovered they have to be interpreted in order to have some value for the data mining project.
81.Define data marting. List the reasons for data marting.
The data mart stores a subset of the data available in the ware house, so that one need not always have to scan through the entire content of the ware house. It is similar to a retail outlet. A data mart speeds up the queries, since the volume of data to be scanned is much less. It also helps to have tail or made processes for different access tools, imposing control strategies etc.,.
Following are the reasons for which data marts are created:
i) Since the volume of data scanned is small, they speed up the query processing.
ii) Data can be structured in a form suitable for a user access too
iii) Data can be segmented or partitioned so that they can be used on different platforms and also different control strategies become applicable.
82. Explain how to categorize data mining system.
There are many data mining systems available or being developed. Some are specialized systems dedicated to a given data source or are confined to limited data mining functionalities, other are more versatile and comprehensive. Data mining systems can be categorized according to various criteria among other classification are the following:
a) Classification according to the type of data source mined: this classification categorizes data mining systems according to the type of data handled such as spatial data, multimedia data, time-series data, text data, World Wide Web, etc.
b) Classification according to the data model drawn on: this classification categorizes data mining systems based on the data model involved such as relational database, object-oriented database, data warehouse, transactional, etc.
c) Classification according to the king of knowledge discovered: this classification categorizes data mining systems based on the kind of knowledge discovered or data mining functionalities, such as characterization, discrimination, association, classification, clustering, etc. Some systems tend to be comprehensive systems offering several data mining functionalities together.
d) Classification according to mining techniques used: Data mining systems employ and provide different techniques. This classification categorizes data mining systems according to the data analysis approach used such as machine learning, neural networks, genetic algorithms, statistics, visualization, database oriented or data warehouse-oriented, etc.
83. List and explain different kind of data that can be mined.
Different kind of data that can be mined are listed below:-
i). Flat files: Flat files are actually the most common data source for data mining algorithms, especially at the research level.
ii). Relational Databases: A relational database consists of a set of tables containing either values of entity attributes, or values of attributes from entity relationships.
iii). Data Warehouses: A data warehouse as a storehouse, is a repository of data collected from multiple data sources (often heterogeneous) and is intended to be used as a whole under the same unified schema.
iv). Multimedia Databases: Multimedia databases include video, images, audio and text media. They can be stored on extended object-relational or object-oriented databases, or simply on a file system.
v). Spatial Databases: Spatial databases are databases that in addition to usual data, store geographical information like maps, and global or regional positioning.
vi). Time-Series Databases: Time-series databases contain time related data such stock market data or logged activities. These databases usually have a continuous flow of new data coming in, which sometimes causes the need for a challenging real time analysis.
vii). World Wide Web: The World Wide Web is the most heterogeneous and dynamic repository available. A very large number of authors and publishers are continuously contributing to its growth and metamorphosis and a massive number of users are accessing its resources daily.
84. Give the syntax for task relevant data specification.
Syntax for tax-relevant data specification:-
The first step in defining a data mining task is the specification of the task-relevant data, that is, the data on which mining is to be performed. This involves specifying the database and tables or data warehouse containing the relevant data, conditions for selecting the relevant data, the relevant attributes or dimensions for exploration, and instructions regarding the ordering or grouping of the data retrieved. DMQL provides clauses for the clauses for the specification of such information, as follows:-
i). use database (database_name) or use data warehouse (data_warehouse_name): The use clause directs the mining task to the database or data warehouse specified.
ii). from (relation(s)/cube(s)) [where(condition)]: The from and where clauses respectively specify the database tables or data cubes involved, and the conditions defining the data to be retrieved.
iii). in relevance to (attribute_or_dimension_list): This clause lists the attributes or dimensions for exploration.
iv). order by (order_list): The order by clause specifies the sorting order of the task relevant data.
v). group by (grouping_list): the group by clause specifies criteria for grouping the data.
vi). having (conditions): The having cluase specifies the condition by which groups of data are considered relevant.
85. Explain the designing of GUI based on data mining query language.
A data mining query language provides necessary primitives that allow users to communicate with data mining systems. But novice users may find data mining query language difficult to use and the syntax difficult to remember. Instead , user may prefer to communicate with data mining systems through a graphical user interface (GUI). In relational database technology , SQL serves as a standard core language for relational systems , on top of which GUIs can easily be designed. Similarly, a data mining query language may serve as a core language for data mining system implementations, providing a basis for the development of GUI for effective data mining.
A data mining GUI may consist of the following functional components:-
a) Data collection and data mining query composition - This component allows the user to specify task-relevant data sets and to compose data mining queries. It is similar to GUIs used for the specification of relational queries.
b) Presentation of discovered patterns – This component allows the display of the discovered patterns in various forms, including tables, graphs, charts, curves and other visualization techniques.
c) Hierarchy specification and manipulation - This component allows for concept hierarchy specification , either manually by the user or automatically. In addition , this component should allow concept hierarchies to be modified by the user or adjusted automatically based on a given data set distribution.
d) Manipulation of data mining primitives – This component may allow the dynamic adjustment of data mining thresholds, as well as the selection, display and modification of concept hierarchies. It may also allow the modification of previous data mining queries or conditions.
e) Interactive multilevel mining – This component should allow roll-up or drill-down operations on discovered patterns.
f) Other miscellaneous information – This component may include on-line help manuals, indexed search , debugging and other interactive graphical facilities.
86. Explain how decision trees are useful in data mining.
Decision trees are powerful and popular tools for classification and prediction. The attractiveness of tree-based methods is due in large part to the fact that, it is simple and decision trees represent rules. Rules can readily be expressed so that we humans can understand them or in a database access language like SQL so that records falling into a particular category may be retrieved.
87 Identify an application and also explain the techniques that can be incorporated in solving the problem using data mining techniques.
Write yourself...
88.Write a short notes on :
i) Data Mining Querying Language
ii) Schedule Manager
iii) Data Formatting.
i) Data Mining Querying Language
A data mining language helps in effective knowledge discovery from the data mining systems. Designing
a comprehensive data mining language is challenging because data mining covers a wide spectrum of
tasks from data characterization to mining association rules, data classification and evolution analysis.
Each task has different requirements. The design of an effective data mining query language requires a
deep understanding of the power, limitation and underlying mechanism of the various kinds of data mining
tasks.
ii) Schedule manager
The scheduling is the key for successful warehouse management. Almost all operations in the ware
house need some type of scheduling. Every operating system will have it’s own scheduler and batch
control mechanism. But these schedulers may not be capable of fully meeting the requirements of a data
warehouse. Hence it is more desirable to have specially designed schedulers to manage the operations.
iii) Data formatting
Final data preparation step which represents syntactic modifications to the data that do not change its
meaning, but are required by the particular modelling tool chosen for the DM task. These include:
a). reordering of the attributes or records: some modelling tools require reordering of the attributes
(or records) in the dataset: putting target attribute at the beginning or at the end, randomizing
order of records (required by neural networks for example)
b). changes related to the constraints of modelling tools: removing commas or tabs, special
characters, trimming strings to maximum allowed number of characters, replacing special
characters with allowed set of special characters.
---------------------------------------------------------------------------------------------------------------
89.With neat diagram explain the main
parts of the computer
A Computer will have 3 basic main parts –
i). A central processing unit that does all the arithmetic and
logical operations. This can be
thought of as the heart of any computer and computers are
identified by the type of CPU
that they use.
ii). The memory is supposed to hold the programs and data. All
the computers that we came
across these days are what are known as “stored program
computers”. The programs are
to be stored before hand in the memory and the CPU accesses
these programs line by line
and executes them.
iii). The Input/output devices: These devices facilitate the
interaction of the users with the computer.
The input devices are used to send information to the computer,
while the output devices
accept the processed information form the computer and make it
available to the user.
Diagram:-

91. Briefly explain the types of
memories.
There are two types of memories – Primary memory, which is
embedded in the computer
and which is the main source of data to the computer and the
secondary memory like floppy disks, CDs etc., which can be carried around
and used in different computers. They cost much less than the
primary memory, but the CPU can access data only from the primary memory.
The main advantage of computer memories, both primary and secondary, is
that they can store data indefinitely and accurately
92. Describe
the basic concept of databases.
The
Concept of Database :-
We
have seen in the previous section how data can be stored in computer. Such
stored data becomes
a
“database” – a collection of data. For example, if all the marks scored by all
the students of a class are
stored
in the computer memory, it can be called a database. From such a database, we
can answer
questions
like – who has scored the highest marks? ; In which subject the maximum number
of students
have
failed?; Which students are weak in more than one subject? etc. Of course,
appropriate programs
have
to be written to do these computations. Also, as the database becomes too large
and more and more
data
keeps getting included at different periods of time, there are several other
problems about “maintaining”
these
data, which will not be dealt with here.
Since
handling of such databases has become one of the primary jobs of the computer
in recent years,
it
becomes difficult for the average user to keep writing such programs. Hence,
special languages –
called
database query languages- have been deviced, which makes such programming easy,
there languages
help
in getting specific “queries” answered easily.
93. With example explain the different
views of a data.
Data
is normally stored in tabular form, unless storage in other formats becomes
advantageous, we
store
data in what are technically called “relations” or in simple terms as “tables”.
The views are Mainly 2
types .
i).
Simple View
ii).
Complex View
Simple view:
- It is created by selecting only one table.
- It does not contains functions.
- it can perform DML (SELECT,INSERT,UPDATE,DELETE,MERGE, CALL,LOCK
TABLE) operations through simple view.
Complex view :
-It is created by selecting more than one table.
-It can performs functions.
-You can not perform always DML operations through
94 Briefly
explain the concept of normalization.
Normalization
is dealt with in several chapters of any books on database management systems.
Here, we will take the simplest definition, which suffices our purpose
namely any field should not have subfields.

Again consider the following student table.
Here under the field marks, we have 3 sub fields: marks for
subject1, marks for subject2 and subject3.

However, it is preferable split these subfields to regular
fields as shown below
Quite often, the original table which comes with subfields will
have to be modified suitable, by the
process of “normalization”.
95 Explain
the concept of data ware house delivery process in detail.
The concept of data ware
house delivery process :-
This
section deals with the dataware house from a different view point - how the
different components that go into it enable the building of a data ware
house. The study helps us in two ways:
i) to have a clear view of the data ware house building process.
ii) to understand the working of the data ware house in the context of
the components.
Now we look at the
concepts in details :-
i). IT Strategy : The company must and
should have an overall IT strategy and the data ware housing has to be a
part of the overall strategy.
ii). Business case analysis : This
looks at an obvious thing, but is most often misunderstood. The overall
understanding of the business and the importance of various components there in
is a must. This will ensure that one can clearly justify the appropriate
level of investment that goes into the data ware house design and also
the amount of returns accruing.
iii). Education : This has
two roles to play - one to make people, specially top level policy makers,
comfortable with the concept. The second role is to aid the prototyping
activity.
iv). Business Requirements : As
has been discussed earlier, it is essential that the business requirements are
fully understood by the data ware house planner. This would ensure that the
ware house is incorporated adequately in the overall setup of the
organization.
v). Technical blue prints : This is
the stage where the overall architecture that satisfies the requirements is
delivered.
vi). Building the vision : Here
the first physical infrastructure becomes available. The major infrastructure
components are set up, first stages of loading and generation of data start up.
vii). History load : Here
the system is made fully operational by loading the required history into the
ware house - i.e. what ever data is available over the previous years is
put into the dataware house to make is fully operational.
viii). Adhoc Query : Now
we configure a query tool to operate against the data ware house.
ix). Automation : This phase automates
the various operational processes like -
a) Extracting and loading
of data from the sources.
b) Transforming the data
into a suitable form for analysis.
c) Backing up, restoration
and archiving.
d) Generate aggregations.
e) Monitoring query
profiles.
x). Extending Scope : There
is not single mechanism by which this can be achieved. As and when needed, a
new set of data may be added, new formats may be included or may be even
involve major changes.
xi). Requirement Evolution : Business
requirements will constantly change during the life of the ware house. Hence,
the process that supports the ware house also needs to be constantly monitored
and modified.
96. What are three major activities of data ware house? Explain.
Three major activities of data ware house are :-
i) Populating the ware house (i.e. inclusion of data)
ii) day-to-day management of the ware house.
iii) Ability to accommodate the changes.
i). The processes to populate the ware house have to be able to extract the data, clean it up, and make it available to the analysis systems. This is done on a daily / weekly basis depending on the quantum of the data population to be incorporated.
ii). The day to day management of data ware house is not to be confused with maintenance and management of hardware and software. When large amounts of data are stored and new data are being continually added at regular intervals, maintaince of the “quality” of data becomes an important element.
iii). Ability to accommodate changes implies the system is structured in such a way as to be able to cope with future changes without the entire system being remodeled. Based on these, we can view the processes that a typical data ware house scheme should support as follows.
96. What are three major activities of data ware house? Explain.
Three major activities of data ware house are :-
i) Populating the ware house (i.e. inclusion of data)
ii) day-to-day management of the ware house.
iii) Ability to accommodate the changes.
i). The processes to populate the ware house have to be able to extract the data, clean it up, and make it available to the analysis systems. This is done on a daily / weekly basis depending on the quantum of the data population to be incorporated.
ii). The day to day management of data ware house is not to be confused with maintenance and management of hardware and software. When large amounts of data are stored and new data are being continually added at regular intervals, maintaince of the “quality” of data becomes an important element.
iii). Ability to accommodate changes implies the system is structured in such a way as to be able to cope with future changes without the entire system being remodeled. Based on these, we can view the processes that a typical data ware house scheme should support as follows.

97.
Explain the extract and load process of data ware house.
Extract and Load Process : This forms
the first stage of data ware house. External physical systems like the sales
counters which give the sales data, the inventory systems that give inventory
levels etc. constantly feed data to the warehouse. Needless to say, the format
of these external data is to be monitored and modified before loading it into
the ware house. The data ware house must extract the data from the source
systems, load them into their data bases, remove unwanted fields (either
because they are not needed or because they are already there in the data
base), adding new fields / reference data and finally reconciling with the
other data. We shall see a few more details of theses broad actions in the
subsequent paragraphs.
i). A mechanism should be
evolved to control the extraction of data, check their consistency
etc. For example, in some systems, the data is not authenticated
until it is audited.
ii). Having a
set of consistent data is equally important. This especially happens when we
are
having several online systems feeding the data.
iii). Once data is
extracted from the source systems, it is loaded into a temporary data storage
before it is “Cleaned” and loaded into the warehouse.
98. In what ways data needs to be cleaned up and checked? Explain briefly.
Data
needs to be cleaned up and checked in the following ways :-
i) It should be consistent with itself.
ii) It should be consistent with other data
from the same source.
iii) It should be consistent with other data
from other sources.
iv) It should be consistent with the
information already available in the data ware house.
While it is easy to list act the needs of a “clean” data, it is
more difficult to set up systems that
automatically cleanup the data. The normal course is to suspect
the quality of data, if it does not meet the normally standards of commonsense
or it contradicts with the data from other sources, data already available in
the data ware house etc. Normal intution doubts the validity of the new data
and effective measures like rechecking, retransmission etc., are undertaken.
When none of these are possible, one may even resort to ignoring the entire set
of data and get on with next set of incoming data.
99. Explain
the architecture of data warehouse.
The architecture for a data ware is indicated below. Before we
proceed further, we should be clear about the concept of architecture. It only
gives the major items that make up a data ware house. The size and complexity
of each of these items depend on the actual size of the ware house itself, the
specific requirements of the ware house and the actual details of
implementation.

100. Briefly explain the functions of each manager of data warehouse.
The Warehouse Manager : The ware house manager is a component that performs all operations necessary to support the ware house management process. Unlike the load manager, the warehouse management process is driven by the extent to which the operational management of the data ware house has been automated.

The ware house manger can be easily termed to be the most
complex of the ware house components, and performs a variety of tasks. A few of
them can be listed below.
i) Analyze the data to confirm data
consistency and data integrity.
ii) Transform and merge the source data from
the temporary data storage into the ware house.
iii) Create indexes, cross references,
partition views etc.,.
iv) Check for normalization’s.
v) Generate new aggregations, if needed.
vi) Update all existing aggregations
vii) Create backups of data.
viii) Archive the data that needs to be
archived.
101.
Explain the star schema to represent the sales analysis.
Star schemes are data base schemas that structure the data to
exploit a typical decision support
enquiry. When the components of typical enquiry’s are examined, a few similarities stand out.
i) The queries examine a set of factual transactions - sales for example.
ii) The queries analyze the facts in different ways - by aggregating them on different bases /
graphing them in different ways.
The central concept of most such transactions is a “fact table”. The surrounding references are called dimension tables. The combination can be called a star schema.
enquiry. When the components of typical enquiry’s are examined, a few similarities stand out.
i) The queries examine a set of factual transactions - sales for example.
ii) The queries analyze the facts in different ways - by aggregating them on different bases /
graphing them in different ways.
The central concept of most such transactions is a “fact table”. The surrounding references are called dimension tables. The combination can be called a star schema.

102. What do you mean by partition of
data? Explain briefly.
Partitioning of data :-
In most ware houses, the size of the fact data tables tends to
become very large. This leads to several problems of management, backup,
processing etc. These difficulties can be over come by partitioning each fact
table into separate partitions.
Data ware houses tend to exploit these ideas by partitioning the
large volume of data into data sets. For example, data can be partitioned on
weekly / monthly basis, so as the minimize the amount of data scanned before
answering a query. This technique allows data to be scanned to be minimized,
without the overhead of using an index. This improves the overall efficiency of
the system. However, having too many partitions can be counter productive and
an optimal size of the partitions and the number of such partitions is of vital
importance.
Participating generally helps in the following ways.
i) Assists in better management of data
ii) Ease of backup / recovery since the volumes are
less.
iii) The star schemes with partitions produce
better performance.
iv) Since several hardware architectures operate
better in a partitioned environment, the overall
system performance improve.
103. Describe the terms data mart and
Meta data.
Data mart :-
A data mart is a
subset of information content of a data ware house, stored in it’s own data
base. The data of a data ware house may have been collected through a ware
house or in some cases, directly from the source. In a crude sense, if you
consider a data ware house as a whole sale shop of data, a data mart can be
thought of as a retailer.
Meta data
:-
Meta data is simply
data about data. Data normally describe the objects, their
quantity, size, how
they are stored etc. Similarly meta data stores data about how data (of
objects) is stored, etc.
Meta data is useful in
a number of ways. It can map data sources to the common view of information
within the warehouse. It is helpful in query management, to direct query to
most appropriate source etc.,.
The structure of meta
data is different for each process. It means for each volume of data, there are
multiple sets of meta data describing the same volume. While this is a very
convenient way of managing data, managing meta data itself is not a very easy
task.
104. Enlist the differences between fact
and dimension.
This ensures that key
dimensions are no fact tables.
Consider the following
example :-

Let us elaborate a
little on the example. Consider a customer A. If there is a situation, where
the
warehouse is building
the profiles of customer, then A becomes a fact - against the name A, we can
list his address, purchases, debts etc. One can ask questions like how many
purchases has A made in the last 3 months etc. Then A is fact. On the other hand,
if it is likely to be used to answer questions like “how many customers have
made more than 10 purchases in the last 6 months”, and one uses the data of A,
as well as of other customers to give the answer, then it becomes a fact table.
The rule is, in such cases, avoid making A as a candidate key.
105. Explain the designing of star-flake
schema in detail.
A star flake schema, as we have defined previously, is
a schema that uses a combination of denormalised star and normalized snow flake
schemas. They are most appropriate in decision support data ware houses.
Generally, the detailed transactions are stored within a central fact table,
which may be partitioned horizontally or vertically. A series of combinatory
data base views are created to allow the user to access tools to treat the fact
table partitions as a single, large table.
The key reference data
is structured into a set of dimensions. Theses can be referenced from the fact
table. Each dimension is stored in a series of normalized tables (snow flakes),
with an additional denormalised star dimension table.
106. What is query redirection? Explain.
Query Redirection :-
One of the basic
requirements for successful operation of star flake schema (or any schema, for
that matter) is the ability to direct a query to the most appropriate source.
Note that once the available data grows beyond a certain size, partitioning
becomes essential. In such a scenario, it is essential that, in order to
optimize the time spent on querying, the queries should be directed to the
appropriate partitions that store the date required by the query.
The basic method is to
design the access tool in such away that it automatically defines the locality
to which the query is to be redirected.
107. In detail, explain the
multidimensional schema.
Multidimensional schemas :-
Before we close, we
see the interesting concept of multi dimensions. This is a very convenient
method of analyzing
data, when it goes beyond the normal tabular relations.
For example, a store
maintains a table of each item it sells over a month as a table, in each of
it’s 10 outlets..

This is a 2
dimensional table. One the other hand, if the company wants a data of all items
sold by it’s outlets, it can be done by simply by superimposing the 2
dimensional table for each of these items – one behind the other. Then it
becomes a 3 dimensional view.
Then the query,
instead of looking for a 2 dimensional rectangle of data, will look for a 3
dimensional cuboid of data.
There is no reason why
the dimensioning should stop at 3 dimensions. In fact almost all queries can be
thought of as approaching a multi-dimensioned unit of data from a
multidimensioned volume of the schema.
108. Why partitioning is needed in large data warehouse?
Partitioning is needed in any large data ware house to ensure that the performance and manageability is improved. It can help the query redirection to send the queries to the appropriate partition, thereby reducing the overall time taken for query processing.
108. Why partitioning is needed in large data warehouse?
Partitioning is needed in any large data ware house to ensure that the performance and manageability is improved. It can help the query redirection to send the queries to the appropriate partition, thereby reducing the overall time taken for query processing.
109. Explain the types
of partitioning in detail.
i). Horizontal partitioning :-
i). Horizontal partitioning :-
This is essentially means that the table is
partitioned after the first few thousand entries, and the next
few thousand entries etc. This is because in most
cases, not all the information in the fact table needed all the time. Thus
horizontal partitioning helps to reduce the query access time, by directly
cutting down the amount of data to be scanned by the queries.
ii).
Vertical partitioning :-
As the name suggests, a vertical partitioning scheme divides the table vertically – i.e. each row is
divided into 2 or more partitions.
iii). Hardware partitioning :-
Needless to say, the dataware design process should try to maximize the performance of the system. One of the ways to ensure this is to try to optimize by designing the data base with respect to specific hardware architecture.
110. Explain the mechanism of row splitting.
Row Splitting :-
The method involved identifying the not so frequently used fields and putting them into another table.
This would ensure that the frequently used fields can be accessed more often, at much lesser computation time.
It can be noted that row splitting may not reduce or increase the overall storage needed, but normalization may involve a change in the overall storage space needed. In row splitting, the mapping is 1 to 1 whereas normalization may produce one to many relationships.
As the name suggests, a vertical partitioning scheme divides the table vertically – i.e. each row is
divided into 2 or more partitions.
iii). Hardware partitioning :-
Needless to say, the dataware design process should try to maximize the performance of the system. One of the ways to ensure this is to try to optimize by designing the data base with respect to specific hardware architecture.
110. Explain the mechanism of row splitting.
Row Splitting :-
The method involved identifying the not so frequently used fields and putting them into another table.
This would ensure that the frequently used fields can be accessed more often, at much lesser computation time.
It can be noted that row splitting may not reduce or increase the overall storage needed, but normalization may involve a change in the overall storage space needed. In row splitting, the mapping is 1 to 1 whereas normalization may produce one to many relationships.

111. Explain the guidelines used for
hardware partitioning.
Guidelines used for
hardware partitioning :-
Needless
to say, the dataware design process should try to maximize the performance of
the system. One of the ways to ensure this is to try to optimize by
designing the data base with respect to specific hardware architecture.
Obviously, the exact details of optimization depends on the hardware
platforms. Normally the following guidelines are useful:-
i).
maximize the processing, disk and I/O operations.
ii).
Reduce bottlenecks at the CPU and I/O
112. What is aggregation? Explain the
need of aggregation. Give example.
Aggregation
: Data
aggregation is an essential component of any decision support data ware house.
It helps us to ensure a cost – effective query performance, which in other
words means that costs incurred to get the answers to a query would be
more than off set by the benefits of the query answer. The data
aggregation attempts to do this by reducing the processing power needed to
process the queries. However, too much of aggregations would only lead to
unacceptable levels of operational costs.
Too
little of aggregations may not improve the performance to the required levels.
A file balancing of
the
two is essential to maintain the requirements stated above. One thumbrule that
is often suggested is that about three out of every four queries would be
optimized by the aggregation process, whereas the fourth will take it’s
own time to get processed. The second, though minor, advantage of
aggregations is that they allow us to get the overall trends in the data.
While looking at individual data such overall trends may not be obvious,
whereas aggregated data will help us draw certain conclusions easily.
113. Explain the different aspects for designing the summary table.
Summary table are designed by following the steps given below :-
i). Decide the dimensions along which aggregation is to be done.
ii). Determine the aggregation of multiple facts.
iii). Aggregate multiple facts into the summary table.
iv). Determine the level of aggregation and the extent of embedding.
v). Design time into the table.
vi). Index the summary table.
114. Give the reasons for creating the data mart.
The following are the reasons for which data marts are created :-
i). Since the volume of data scanned is small, they speed up the query processing.
ii). Data can be structured in a form suitable for a user access too
iii). Data can be segmented or partitioned so that they can be used on different platforms and
also different control strategies become applicable.
115. Explain the two stages in setting up data marts.
There are two stages in setting up data marts :-
i). To decide whether data marts are needed at all. The above listed facts may help you to
decide whether it is worth while to setup data marts or operate from the warehouse itself.
The problem is almost similar to that of a merchant deciding whether he wants to set up retail
shops or not.
ii). If you decide that setting up data marts is desirable, then the following steps have to be gone
through before you can freeze on the actual strategy of data marting.
a) Identify the natural functional splits of the organization.
b) Identify the natural splits of data.
c) Check whether the proposed access tools have any special data base structures.
d) Identify the infrastructure issues, if any, that can help in identifying the data marts.
e) Look for restrictions on access control. They can serve to demarcate the warehouse
details.
116. What are disadvantages of data mart?
There are certain disadvantages :-
i). The cost of setting up and operating data marts is quite high.
ii). Once a data strategy is put in place, the datamart formats become fixed. It may be fairly difficult to change the strategy later, because the data marts formats also have to be changes.
117. What is role of access control issue in data mart design?
Role of access control issue in data mart design :-
This is one of the major constraints in data mart designs. Any data warehouse, with it’s huge volume
of data is, more often than not, subject to various access controls as to who could access which part of data. The easiest case is where the data is partitioned so clearly that a user of each partition cannot access any other data. In such cases, each of these can be put in a data mart and the user of each can access only his data .
In the data ware house, the data pertaining to all these marts are stored, but the partitioning are retained. If a super user wants to get an overall view of the data, suitable aggregations can be generated.
118. Explain the purpose of using metadata in detail.
Metadata will be used for the following purposes :-
i). data transformation and loading.
ii). data management.
iii). query generation.
119. Explain the concept of metadata
management.
Meta
data should be able to describe data as it resides in the data warehouse. This
will help the warehouse manager to control data movements. The purpose of
the metadata is to describe the objects in the database. Some of the
descriptions are listed here.
· Tables
- Columns
* Names
* Types
· Indexes
- Columns
* Name
* Type
· Views
- Columns
* Name
* Type
· Constraints
- Name
- Type
- Table
* Columns
120. How the query manager uses the Meta data? Explain in detail.
Meta data is also required to generate queries. The query manger uses the metadata to build a history of all queries run and generator a query profile for each user, or group of uses.
We simply list a few of the commonly used meta data for the query. The names are self explanatory.
o Query
o Table accessed
§ Column accessed
· Name
· Reference identifier
o Restrictions applied
o Column name
o Table name
o Reference identifier
o Restrictions
o Join criteria applied
o Column name
o Table name
o Reference identifier
o Column name
o Table name
o Reference identifier
o Aggregate function used
o Column name
o Reference identifier
o Aggregate function
o Group by criteria
o Column name
o Reference identifier
o Sort direction
o Syntax
o Resources
o Disk
o Read
o Write
o Temporary
121. Why we need different managers to a data ware house? Explain.
Need for managers to a data ware house :-
Data warehouses are not just large databases. They are complex environments that integrate many
technologies. They are not static, but will be continuously changing both contentwise and structurewise. Thus, there is a constant need for maintenance and management. Since huge amounts of time, money and efforts are involved in the development of data warehouses, sophisticated management tools are always justified in the case of data warehouses.
When the computer systems were in their initial stages of development, there used to be an army of
human managers, who went around doing all the administration and management. But such a scheme became both unvieldy and prone to errors as the systems grew in size and complexity. Further most of the management principles were adhoc in nature and were subject to human errors and fatigue.
122. With neat diagram explain the boundaries of process managers.
A schematic diagram that defines the boundaries of the three types of managers :-
· Tables
- Columns
* Names
* Types
· Indexes
- Columns
* Name
* Type
· Views
- Columns
* Name
* Type
· Constraints
- Name
- Type
- Table
* Columns
120. How the query manager uses the Meta data? Explain in detail.
Meta data is also required to generate queries. The query manger uses the metadata to build a history of all queries run and generator a query profile for each user, or group of uses.
We simply list a few of the commonly used meta data for the query. The names are self explanatory.
o Query
o Table accessed
§ Column accessed
· Name
· Reference identifier
o Restrictions applied
o Column name
o Table name
o Reference identifier
o Restrictions
o Join criteria applied
o Column name
o Table name
o Reference identifier
o Column name
o Table name
o Reference identifier
o Aggregate function used
o Column name
o Reference identifier
o Aggregate function
o Group by criteria
o Column name
o Reference identifier
o Sort direction
o Syntax
o Resources
o Disk
o Read
o Write
o Temporary
121. Why we need different managers to a data ware house? Explain.
Need for managers to a data ware house :-
Data warehouses are not just large databases. They are complex environments that integrate many
technologies. They are not static, but will be continuously changing both contentwise and structurewise. Thus, there is a constant need for maintenance and management. Since huge amounts of time, money and efforts are involved in the development of data warehouses, sophisticated management tools are always justified in the case of data warehouses.
When the computer systems were in their initial stages of development, there used to be an army of
human managers, who went around doing all the administration and management. But such a scheme became both unvieldy and prone to errors as the systems grew in size and complexity. Further most of the management principles were adhoc in nature and were subject to human errors and fatigue.
122. With neat diagram explain the boundaries of process managers.
A schematic diagram that defines the boundaries of the three types of managers :-

123. Explain the responsibilities of each manager of data ware house.
Ware house Manager :-
The warehouse manager is responsible for maintaining data of the ware house. It should also create
and maintain a layer of meta data. Some of the responsibilities of the ware house manager are
o Data movement
o Meta data management
o Performance monitoring
o Archiving.
Data movement includes the transfer of data within the ware house, aggregation, creation and
maintenance of tables, indexes and other objects of importance. It should be able to create new aggregations as well as remove the old ones. Creation of additional rows / columns, keeping track of the aggregation processes and creating meta data are also it’s functions.
124. What are the different system management tools used for data warehouse?
The different system management tools used for data warehouse :-
i). Configuration managers
ii). schedule managers
iii). event managers
iv). database mangers
v). back up recovery managers
vi). resource and performance a monitors.
No comments:
Post a Comment