Data Modeling Interview Questions

Q.What is Data warehousing? Ans: A data warehouse can be considered as a storage area where interest specific or relevant data is stored irrespective of the source. What actually is required to create a data warehouse can be considered as Data Warehousing. Data warehousing merges data from multiple sources into an easy and complete form. Q.What are fact tables and dimension tables? Ans: As mentioned, data in a warehouse comes from the transactions. Fact table in a data warehouse consists of facts and/or measures. The nature of data in a fact table is usually numerical. On the other hand, dimension table in a data warehouse contains fields used to describe the data in fact tables. A dimension table can provide additional and descriptive information (dimension) of the field of a fact table. e.g. If I want to know the number of resources used for a task, my fact table will store the actual measure (of resources) while my Dimension table will store the task and resource details. Hence, the relation between a fact and dimension table is one to many. Q.What is ETL process in data warehousing? Ans: ETL is Extract Transform Load. It is a process of fetching data from different sources, converting the data into a consistent and clean form and load into the data warehouse. Different tools are available in the market to perform ETL jobs.  Q.Explain the difference between data mining and data warehousing. Ans: Data warehousing is merely extracting data from different sources, cleaning the data and storing it in the warehouse. Where as data mining aims to examine or explore the data using queries. These queries can be fired on the data warehouse. Explore the data in data mining helps in reporting, planning strategies, finding meaningful patterns etc. E.g. a data warehouse of a company stores all the relevant information of projects and employees. Using Data mining, one can use this data to generate different reports like profits generated etc. Q.What is an OLTP system and OLAP system? Ans: OLTP: Online Transaction and Processing helps and manages applications based on transactions involving high volume of data. Typical example of a transaction is commonly observed in Banks, Air tickets etc. Because OLTP uses client server architecture, it supports transactions to run cross a network. OLAP: Online analytical processing performs analysis of business data and provides the ability to perform complex calculations on usually low volumes of data. OLAP helps the user gain an insight on the data coming from different sources (multi dimensional). Q.What is PDAP? Ans: A data cube stores data in a summarized version which helps in a faster analysis of data. The data is stored in such a way that it allows reporting easily. E.g. using a data cube A user may want to analyze weekly, monthly performance of an employee. Here, month and week could be considered as the dimensions of the cube. Q.What is snow flake scheme design in database? Ans: A snowflake Schema in its simplest form is an arrangement of fact tables and dimension tables. The fact table is usually at the center surrounded by the dimension table. Normally in a snow flake schema the dimension tables are further broken down into more dimension table. E.g. Dimension tables include employee, projects and status. Status table can be further broken into status_weekly, status_monthly. Q.What is analysis service? Ans: Analysis service provides a combined view of the data used in OLAP or Data mining. Services here refer to OLAP, Data mining. Q.Explain sequence clustering algorithm. Ans: Sequence clustering algorithm collects similar or related paths, sequences of data containing events. E.g. Sequence clustering algorithm may help finding the path to store a product of “similar” nature in a retail ware house. Q.Explain discrete and continuous data in data mining. Ans: Discreet data can be considered as defined or finite data. E.g. Mobile numbers, gender. Continuous data can be considered as data which changes continuously and in an ordered fashion. E.g. age Q.Explain time series algorithm in data mining. Ans: Time series algorithm can be used to predict continuous values of data. Once the algorithm is skilled to predict a series of data, it can predict the outcome of other series. E.g. Performance one employee can influence or forecast the profit Q.What is XMLA? Ans: XMLA is XML for Analysis which can be considered as a standard for accessing data in OLAP, data mining or data sources on the internet. It is Simple Object Access Protocol. XMLA uses discover and Execute methods. Discover fetched information from the internet while Execute allows the applications to execute against the data sources. Q.Explain the difference between Data warehousing and Business Intelligence. Ans: Data Warehousing helps you store the data while business intelligence helps you to control the data for decision making, forecasting etc. Data warehousing using ETL jobs, will store data in a meaningful form. However, in order to query the data for reporting, forecasting, business intelligence tools were born. Q.What is Dimensional Modeling? Ans: Dimensional modeling is often used in Data warehousing. In simpler words it is a rational or consistent design technique used to build a data warehouse. DM uses facts and dimensions of a warehouse for its design. A snow and star flake schema represent data modeling Q.What is surrogate key? Explain it with an example. Ans: Data warehouses commonly use a surrogate key to uniquely identify an entity. A surrogate is not generated by the user but by the system. A primary difference between a primary key and surrogate key in few databases is that PK uniquely identifies a record while a SK uniquely identifies an entity. E.g. an employee may be recruited before the year 2000 while another employee with the same name may be recruited after the year 2000. Here, the primary key will uniquely identify the record while the surrogate key will be generated by the system (say a serial number) since the SK is NOT derived from the data. Q.What is the purpose of Factless Fact Table? Ans: Fact less tables are so called because they simply contain keys which refer to the dimension tables. Hence, they don’t really have facts or any information but are more commonly used for tracking some information of an event. Eg. To find the number of leaves taken by an employee in a month. Q.What is a level of Granularity of a fact table? Ans: A fact table is usually designed at a low level of Granularity. This means that we need to find the lowest level of information that can store in a fact table. E.g. Employee performance is a very high level of granularity. Employee_performance_daily, employee_perfomance_weekly can be considered lower levels of granularity. Q.Explain the difference between star and snowflake schemas. Ans: A snow flake schema design is usually more complex than a start schema. In a start schema a fact table is surrounded by multiple fact tables. This is also how the Snow flake schema is designed. However, in a snow flake schema, the dimension tables can be further broken down to sub dimensions. Hence, data in a snow flake schema is more stable and standard as compared to a Start schema. E.g. Star Schema: Performance report is a fact table. Its dimension tables include performance_report_employee, performance_report_manager Snow Flake Schema: the dimension tables can be broken to performance_report_employee_weekly, monthly etc. Q.What is the difference between view and materialized view? Ans: A view is created by combining data from different tables. Hence, a view does not have data of itself. On the other hand, Materialized view usually used in data warehousing has data. This data helps in decision making, performing calculations etc. The data stored by calculating it before hand using queries. When a view is created, the data is not stored in the database. The data is created when a query is fired on the view. Whereas, data of a materialized view is stored. Data Modeling Interview Questions            Data Modeling Interview Questions and Answers Q.What is junk dimension? Ans: In scenarios where certain data may not be appropriate to store in the schema, this data (or attributes) can be stored in a junk dimension. The nature of data of junk dimension is usually Boolean or flag values. E.g. whether the performance of employee was up to the mark? , Comments on performance. Q.What are fundamental stages of Data Warehousing? Ans: Stages of a data warehouse helps to find and understand how the data in the warehouse changes. At an initial stage of data warehousing data of the transactions is merely copied to another server. Here, even if the copied data is processed for reporting, the source data’s performance won’t be affected. In the next evolving stage, the data in the warehouse is updated regularly using the source data. In Real time Data warehouse stage data in the warehouse is updated for every transaction performed on the source data (E.g. booking a ticket) When the warehouse is at integrated stage, It not only updates data as and when a transaction is performed but also generates transactions which are passed back to the source online data. Q.What is Data Scheme? Ans: Data Scheme is a diagrammatic representation that illustrates data structures and data relationships to each other in the relational database within the data warehouse. The data structures have their names defined with their data types. Data Schemes are handy guides for database and data warehouse implementation. The Data Scheme may or may not represent the real lay out of the database but just a structural representation of the physical database. Data Schemes are useful in troubleshooting databases. Q.What is Bit Mapped Index? Ans: Bitmap indexes make use of bit arrays (bitmaps) to answer queries by performing bitwise logical operations. They work well with data that has a lower cardinality which means the data that take fewer distinct values. Bitmap indexes are useful in the data warehousing applications. Bitmap indexes have a significant space and performance advantage over other structures for such data. Tables that have less number of insert or update operations can be good candidates. The advantages of Bitmap indexes are:

  • They have a highly compressed structure, making them fast to read.
  • Their structure makes it possible for the system to combine multiple indexes together so that they can access the underlying table faster.

The Disadvantage of Bitmap indexes is:

  • The overhead on maintaining them is enormous.

Q.What is Bi-directional Extract? Ans: In hierarchical, networked or relational databases, the data can be extracted, cleansed and transferred in two directions. The ability of a system to do this is refered to as bidirectional extracts. This functionality is extremely useful in data warehousing projects. Data Extraction The source systems the data is extracted from vary in various forms right from their structures and file formats to the department and the business segment they belong to. Common source formats include flat files and relational database and other non-relational database structures such as IMS, VSAM or ISAM. Data transformation The extracted data may undergo transformation with possible addition of metadata before they are exported to another large storage area. In transformation phase, various functions related to business needs, requirements, rules and policies are applied on them. During this process some values even get translated and encoded. Care is also taken to avoid redundancy of data. Data Cleansing In data cleansing, scrutinizing of the incorrect or corrupted data is done and those inaccuracies are removed. Thus data consistency is ensured in Data cleansing. It involves activities like - removing typographical errors and inconsistencies - comparing and validating data entries against a list of entities Data transformation This is the last process of Bidirectional Extracts. The cleansed, transformed extracted source data is then loaded into the data warehouse. Advantages - Updates and data loading become very fast due to bidirectional extracting. - As timely updates are received in a useful pattern companies can make good use of this data to launch new products and formulate market strategies. Disadvantage - More investment on advance and faster IT infrastructure. - Not being able to come up with fault tolerance may mean unexpected stoppage of operations when the system breaks. - Skilled data administrator needs to be hired to manage the complex process. Q.What is Data Collection Frequency? Ans: Data collection frequency is the rate at which data is collected. However, the data is not just collected and stored. it goes through various stages of processing like extracting from various sources, cleansing, transforming and then storing in useful patterns. It is important to have a record of the rate at which data is collected because of various reasons: Companies can use these records to keep a track of the transactions that have occurred. Based on these records the company can know if any invalid transactions ever occurred. In scenarios where the market changes rapidly, companies need very frequently updated data to enable them make decisions based on the state of the market and then invest appropriately. A few companies keep launching new products and keep updating their records so that their customers can see them which would in turn increase their business. When data warehouses face technical problems, the logs as well as the data collection frequency can be used to determine the time and cause of the problem. Due to real time data collection, database managers and data warehouse specialists can make more room for recording data collection frequency. Q.What is Data Cardinality? Ans:   Cardinality is the term used in database relations to denote the occurrences of data on either side of the relation. There are 3 basic types of cardinality: High data cardinality: Values of a data column are very uncommon. e.g.: email ids and the user names Normal data cardinality: Values of a data column are somewhat uncommon but never unique. e.g.: A data column containing LAST_NAME (there may be several entries of the same last name) Low data cardinality: Values of a data column are very usual. e.g.: flag statuses: 0/1 Determining data cardinality is a substantial aspect used in data modeling. This is used to determine the relationships Types of cardinalities: The Link Cardinality - 0:0 relationships The Sub-type Cardinality - 1:0 relationships The Physical Segment Cardinality - 1:1 relationship The Possession Cardinality - 0: M relation The Child Cardinality - 1: M mandatory relationsh The Characteristic Cardinality - 0: M relationship The Paradox Cardinality - 1: M relationship. Q.What is Chained Data Replication? Ans: In Chain Data Replication, the non-official data set distributed among many disks provides for load balancing among the servers within the data warehouse. Blocks of data are spread across clusters and each cluster can contain a complete set of replicated data. Every data block in every cluster is a unique permutation of the data in other clusters. When a disk fails then all the calls made to the data in that disk are redirected to the other disks when the data has been replicated. At times replicas and disks are added online without having to move around the data in the existing copy or affect the arm movement of the disk. In load balancing, Chain D.ata Replication has multiple servers within the data warehouse share data request processing since data already have replicas in each server disk. Q.What are Critical Success Factors? Ans: Key areas of activity in which favorable results are necessary for a company to reach its goal. There are four basic types of CSFs which are: Industry CSFs Strategy CSFs Environmental CSFs Temporal CSFs A few CSFs are: Money Your future Customer satisfaction Quality Product or service development Intellectual capital Strategic relationships Employee attraction and retention Sustainability The advantages of identifying CSFs are: they are simple to understand; they help focus attention on major concerns; they are easy to communicate to coworkers; they are easy to monitor; and they can be used in concert with strategic planning methodologies. Q.What is Virtual Data Warehousing? Ans: A virtual data warehouse provides a collective view of the completed data. A virtual data warehouse has no historic data. It can be considered as a logical data model of the containing metadata. Q.Explain in brief various fundamental stages of Data Warehousing. Ans: Stages of a data warehouse helps to find and understand how the data in the warehouse changes. At an initial stage of data warehousing data of the transactions is merely copied to another server. Here, even if the copied data is processed for reporting, the source data’s performance won’t be affected. In the next evolving stage, the data in the warehouse is updated regularly using the source data. In Real time Data warehouse stage data in the warehouse is updated for every transaction performed on the source data (E.g. booking a ticket) When the warehouse is at integrated stage, It not only updates data as and when a transaction is performed but also generates transactions which are passed back to the source online data. Q.What is active data warehousing? Ans: An active data warehouse represents a single state of the business. Active data warehousing considers the analytic perspectives of customers and suppliers. It helps to deliver the updated data through reports. Q.What is data modeling and data mining? What is this used for? Ans: Data Modeling is a technique used to define and analyze the requirements of data that supports organization’s business process. In simple terms, it is used for the analysis of data objects in order to identify the relationships among these data objects in any business. Data Mining is a technique used to analyze datasets to derive useful insights/information. It is mainly used in retail, consumer goods, telecommunication and financial organizations that have a strong consumer orientation in order to determine the impact on sales, customer satisfaction and profitability. Data Mining is very helpful in determining the relationships among different business attributes. Q.Difference between ER Modeling and Dimensional Modeling? Ans: The entity-relationship model is a method used to represent the logical flow of entities/objects graphically that in turn create a database. It has both logical and physical model. And it is good for reporting and point queries. Dimensional model is a method in which the data is stored in two types of tables namely facts table and dimension table. It has only physical model. It is good for ad hoc query analysis. Q.What is the difference between data warehousing and business intelligence? Ans: Data warehousing relates to all aspects of data management starting from the development, implementation and operation of the data sets. It is a back up of all data relevant to business context i.e. a way of storing data Business Intelligence is used to analyze the data from the point of business to measure any organization’s success. The factors like sales, profitability, marketing campaign effectiveness, market share and operational efficiency etc are analyzed using Business Intelligence tools like Cognos, Informatica, SAS etc. Q.Describe dimensional Modeling. Ans: Dimensional model is a method in which the data is stored in two types of tables namely facts table and dimension table. Fact table comprises of information to measure business successes and the dimension table comprises of information on which the business success is calculated. It is mainly used by data warehouse designers to build data warehouses. It represents the data in a standard and sequential manner that triggers for high performance access. Q.What is snapshot with reference to data warehouse? Ans: Snapshot refers to a complete visualization of data at the time of extraction. It occupies less space and can be used to back up and restore data quickly. Contact for more information on Data Modeling Online Training data modeling interview questions