What is MSBI?
The need for Business Intelligence tools does not exhaust as long as IT and the internet exist around us. So many business intelligence vendors were adding more and more features to these tools for quick analysis of the data. This article on MSBI is another example to let you know their need and importance of business intelligence tools in the market. Without wasting much time, let us move into the actual topic.
What is MSBI?
It is a Microsoft business intelligence tool. The business intelligence tool is capable of providing the ultimate solutions to execute data mining and business queries. Besides, this tool provides various types of data access to the companies where the companies can take business decisions and can also plan for the future. Moreover, with this tool, companies can take better decisions and can also plan for the future.
By default, these business intelligence tools provide some plans to work with the data and analyze it. Besides, this tool also allows you to implement the new data as well. This powerful suite is composed of many tools that help in providing the best solution for business intelligence and data mining queries. This tool uses Visual Studio and SQL Server. Besides, it offers different tools for different processes that are necessary for Business Intelligence solutions. Moreover, this Microsoft tool is capable of understanding complex data, allocating, analyzing, and setting up a proper report that helps in taking business decisions.
Why MSBI is necessary?
Many data analyst/scientist use MSBI due to the following reasons:
a)This Microsoft tool is capable of storing as well as retrieving the data to take the smart decisions
b)This Microsoft BI suite consists of several resourceful tools to provide the best solution for business intelligence.
- c) Since the market is lagging the SSAS, SSIS, SSRS professionals, demand for this business intelligence tool has increased much.
- d) To get different tools for the different process that are necessary for the business intelligence solutions
Besides, there are many other reason to learn MSBI to be a successful IT professional. Are you curious to know what are those? then visit MSBI Online Training.
Architecture:
This Microsoft business intelligence tool has 3 components. They are :
SSIS – SQL Server Integration Services
SSAS- SQL Server Analytical Services
SSRS –SQL Server Reporting Services
Let us discuss each of them briefly:
SSIS -
As the name suggest, this component is useful to integrate the data coming from different sources. This component is responsible for the integration data ware housing. Since the data gets collected from various sources to integrate, this component uses Extract, Transform and load (ETL) process to store the data. Moreover, this phase is responsible to store the data from different locations, ingrate it and ultimately store the data in a data ware house. This tool is capable of building high performance integration and workflows. Besides this tool contains various graphics tools and wizards for building packages. In simple words, this component suits best for bulk transaction. And this is useful to generate the trend reports, predictive analysis and the comparison reports. Hence this tool fits best for the business analyst in making the quick decisions.
SSAS-
This is the process of converting the two dimensional data into multidimensional data model. This tool suits best in analyzing large volumes of data. Besides, this tool is responsible to analyze the performance of SQL Server in terms of loading balancing, heavy data, transaction. The main responsibility of this tools is to develop Online Analytical Processing Solutions (OLAP). Hence this analytical tool fits in the for the administrators to analyze the data.Moreover, with this analytical tool, admin analyzes the data before moving in to the database. Besides, user can also get the details of number of transactions happen in a sec. This SSAS has many advantages. Some of them were multi-dimensional analysis, Key performance indicator, Score card, Good performance, Security and so on.
SSRS:
As the name indicates, this tool is responsible to prepare the reports that contains virtual. This reporting platform presents the modern as well as he traditional reports through suitable (or) the custom applications. This component is platform-independent and efficient. Moreover, it is also capable of retrieving the data from various sources and can export the functionality in lot of formats.Besides, this tool has access to the web based reports. Hence it is capable of display the reports in the form of guague, tabular, chart and many more.This SSRS has many excellent benefits. Among them the popular services were retrieving data from multiple sources, Support for ad-hoc reporting, export functionality with variable formats and so on.
MSDN Library:
It is a collection of sites for the development team to provide documentation, information as well as the discussion that is delivered by Microsoft. Here, Microsoft have give more importance on the incorporation of forums, blogs, social bookmaking, library annotations.
What are the features of MSBI?
There are many features of MSBI. Let us discuss some of them:
- Provides a singe value of truth to make the effective decisions
- Eliminates (or) reduces the ‘Instinctive decisions’
- Provides a quick and timely answers to the business and making it more responsive to the business trends
- Minimize the manual work
- Support for the historical and summarized data
- Robust support for advanced analysis
Hence likewise, there are many features of this Microsoft business intelligence tool in real time .You people can acquire hands on experience on this Business intelligence tool by live experts through online at MSBI Online Course. I hope you have got a basic regarding the need and utilization of MSBI in IT industry.In the upcoming post of this blog, ill be delivering in detail information of each component individually. Meanwhile have a glance at our MSBI interview Questions and make get placed in an MNC.
Related

Abinitio Online Training
Learn from basics to the advanced level on the utilization of abinitio taught by real-time experts through their real-time scenarios. By the end of this training, you will become an expert in Abinitio

Cognos Online Training
Kits Cognos Online Training Course explains you the various features of Cognos Business intelligence Tools and explains you the ETL flow with real time use cases

Data Modelling Online Training
Accelerate your career in Data modeling by pursing the Data Modelling Online Training Course offered by KITS and get the fundamentals of data modelling and hands-on experience of planning and building

Data Science Online Training
Make your dream come true as a Data Scientist by enhancing your skills through Data analytics, R programming, statistical computing, machine learning algorithms and so on by live use cases taught by c

Data Stage Online Training
KITS Data Stage online training lets you master IBM Data stage tool. Through this course, you will gain hands-on experience on data integration, ETL , data ware housing and working with data in motion

ETL Testing Online Training
Learn how to effectively load data from a source system into a data warehouse by taking ETL Testing Training Online courses with real-world use cases taught by real-world business professionals.

Hadoop Online Training
The KITS Hadoop Training Institutes In Hyderabad provide the greatest information on Hadoop for big data. This course offers real-world, hands-on experience with a variety of frameworks, including spa

Hyperion Financial Management Training
Kits Oracle Hyperion Interactive reporting training, lets you master Hyperion reporting. We provide the best online classes to acquire practical knowledge on Hyperion databases taught by real-time ind

Hyperion SmartView Online Training
Get practical knowledge on Accessing and analyzing the data on various concepts like entering into the task grids, managing the planning approval process using Smartview by live industry experts with

Hyperion Essbase Online Training
KITS Hyperion ESS Base Online Training Course, help you to develop effective enterprise performance management applications and financial custom analytics reports. This training Course, covers the ess

Hyperion FDQM Online Training
Get hands-on experience on financial management to consolidate and report the financial analysis report quickly through KITS Hyperion Financial Management training conducted by live industry experts w

Hyperion Online Training
Enhance your practical knowledge of task list, data modification, and business predictions on various applications taught by real-time industry experts with practical uses cases through Hyperion plann

Hyperion Planning Online Training
Get practical knowledge on Accessing and analyzing the data on various concepts like entering into the task grids, managing the planning approval process using Smartview by live industry experts with

Hyperion Planning Training
Enhance your practical knowledge of task list, data modification, and business predictions on various applications taught by real-time industry experts with practical uses cases through Hyperion plann

Informatic Data Quality Online Training
Get from the roots of Data Quality on Informatica by live industry experts with practical use cases at the Informatica Data Quality Online Training Course. This online course enhances your knowledge o

Informatic IDQ Online Training
Get a strong foundation on the fundamental concepts and principles of IDQ and acquire hands-on experience in the creation of objects and projects, power center integration, and so on by live industry

Informatica MDM Online Training
Acquire the best knowledge on various platforms of Informatica MDM from the basic to the advanced level taught by live industry experts with live use cases and become a master in Informatica data ma

Informatica Online Training
KITS Informatica training provides you the strong knowledge on all the core concepts of Informatica power center to create, execute, administer and schedule the ETL process and get hands-on experience

Informatica Power Center Online Training
Get hands-on experience of widely used extraction, transformation, and loading the tools in enterprise Data ware house using the Informatica tool with practical use cases through Informatica Power Cen

Machine Learning Online Training
Explore the concepts of machine learning from the roots and understand how it has changed the machine learning world. Kits Machine learning Online Training Course will help you to become a Machine lea

MSBI Online Training
Enroll today for MSBI Online Training taught by live industry experts and get practical knowledge of various MSBI tools like SSAS, SSIS, SSRS using SQL Server and get the necessary skills to clear MSB

Power BI Training
Get practical knowledge on Desktop layouts, BI reports, dashboards, DAX commands, Power Query, Power Pivot and functions with practical use cases taught by live industry experts with KITS Power BI Onl

Qlik Sense Online Training
Make your dream come true as a Certified Data Visualization professional through KITS Qlik Sense Online Training Course. This course was designed by industry experts to cover all the modules from the

QlikView Online Training
Learn from basics to advanced level on all features of Qlikview like data modeling, reports, and dashboards, data visualization, object formatting, system tables, and so on by live industry experts wi

SAS Online Course
Master in advanced analytics techniques of SAS language through SAS macros, Machine learning, PROC SQL and get the necessary skills to clear SAS programmer Certification through SAS Online Training Co

Tableau Online Training
Begin your career with a Tableau Certification Course, gain practical experience with data visualization for Tableau Desktop from live industry experts, and realize your dream of becoming a certified

Teradata Online Training
Become a master in developing data warehousing applications taught by real-time industry experts through hands-on exercises and use-cases and become a king of Data warehouse at Teradata Online Train

Teradata Training
Become a master in developing data warehousing applications taught by real-time industry experts through hands-on exercises and use-cases and become a king of Data warehouse at Teradata Online Trainin

Tibco Online Training
Excel your knowledge on building, deploying, and managing the complex business process in TIBCO by practical use cases and make your dream come true as a Certified TIBCO Developer through Tibco Online

Tibco Spotfire Online Training
Get the essential Spotfire functionality and the advanced techniques of TIBCO through Tibco Spot Fire Online Training. This course provides the valuable insights for integration and data management th

How does Hadoop works?
Apache Hadoop is a framework that can store and process huge amounts of unstructured data ranging from terabytes to petabytes. This file system is highly available and fault-tolerant to its users. This platform is capable of storing a massive amount of data in a distributed manner in HDFS. The Hadoop map-reduce is a processing unit in Hadoop that processes the data in parallel. Hadoop YARN is another component of the Hadoop framework that is good at managing the resources amongst applications running in a cluster and scheduling a task. Hadoop has overcome its dependency as it does not rely on hardware but instead achieves high availability and also detects the point of failures in the software itself. Hadoop has also given birth to countless innovations in the big data space. Apache Spark that has been talked about most about the technology was born out of Hadoop. How do Hadoop works? Hadoop does distribute processing of huge data across the cluster of commodity of servers that work on multiple servers simultaneously. To process any data, the client submits the data and program to Hadoop. In the Hadoop Ecosystem, HDFS is good at Data Storage, Map Reduce is good at Data Processing, and YARN is good at task dividing. Are you new to the concept of Hadoop?, then check out our post on What is Hadoop? How does HDFS work in Hadoop? HDFS: HDFS is a distributed file system that runs on master-slave technology. This component has two daemons namely the namenode as well as the data node. Name Node: The name node is a daemon that is running on the master machine. It is the centerpiece of the HDFS file system. The name node store the directory tree of all file in the file system. This Namenode comes into the picture where the client wants to add/copy/move/ delete the file. Whenever the clients request the Name Node return the list of Data Node servers where the actual data resides. Data Node: This daemon runs on the Slave node where it stores the data in Hadoop File System. In a functional file system, the data replicates across many Data Nodes. Initially, the Data Node was connected to the Name Node. It keeps on looking for the request to access the data. Once the Namenode provides the location of the data, the client applications can interact with Data Node directly. And during the data replication, the data node instances can talk to each other. Replica Placement: The Replica placement decides the HDFS performance and reliability. Huge HDFS clusters instance runs on a cluster of computers spread across the racks. The communication here happens by switching between the nodes. The rack awareness algorithm determines the rack id of each Data Node. The replicas get placed on unique racks in a simple policy. It prevents data loss in the event of failure. During the Data retrieval, it utilizes the bandwidths from multiple racks. Map Reduce: The map-reduce algorithm is to processes the data parallelly on a distributed cluster. It is subsequently combined into a desired output/ result. This Map Reduce consists of several stages: In the first step, the program locates and reads the file containing the raw data. Since the file format is arbitrary, there is a need to convert the data into something where something can process. Here the and does this job. Here the Input format uses Input Split function to split the file into smaller pieces. Then the Record Reader transforms the raw data for processing by map. Here the Record Reader outputs a list of key-value pairs. Once the mapper process these key-value pairs the result goes to the . Here we have another function called which intimates the user when the mapper finishes the task In the next step, the reduce function performs its task on each key-value pair from the mapper. Finally, the output pair organizes the key-value pair from the reducer for writing on HDFS. Do you want to know the practical working of Map Reduce? If Yes, visit Hadoop Online Training YARN: YARN is responsible for diving the task on job monitoring/scheduling and resource management into separate daemons. Besides, there is one Resource Manager and per-application Application Master. Here the application can be Job (or) a DAG of jobs. The resource manager has two components. A scheduler and the Application Manager. Here the scheduler is a pure scheduler that does not track the status of the application. Moreover, there is no need of restarting the application in case of application (or) hardware failure. Here the scheduler allocates the resources based on the abstract notation of the computer. Here the container is nothing but a fraction of resources like CPU, memory, disk, network, and many more. The Application Manager does the following tasks: Accept the Job submission by the client Negotiates the first container for a specific application master. Restarts the container after an application failure On the other side Application Master does the following tasks: Negotiates containers from the scheduler Tracks the container status and monitors its progress. This YARN Supports the concepts of Resource Reservation via a Reservation System. Here the users can fix the number of resources for the execution of a particular job over time and temporal constraints. This Reservation system ensures that the resources were available to the job until its completion. YARN is capable of scaling beyond the thousand nodes viaYARN federation. These YARN federations allow wiring multiple subclusters into a single massive cluster. Here, we can use many independent clusters together to form a single large job. Moreover, it is capable of achieving large-scale systems. What is Hadoop used for? Hadoop has become a distributed framework for processing large amounts of structured and semi-structured data. This platform is not good enough in dealing with small data sets. But when compared with a large amount of data, this platform suits best in the following cases: This platform suits well in a variety of big data applications that gather data from different sources in different formats. This platform is very flexible in storing the various data types, irrespective of the data type contains in the data. Hadoop in the big data application has to join the data through any format. Large scale enterprises require clusters of servers, where specialized data management and programming skills were limited where its implementation is a costly affair. What can Hadoop do? Hadoop can be fit into multiple roles depending on the movie. These platforms suit best product recommendations, fraud detection, and identifying diseases, sentiment analysis, infrastructure management, and many more. Hadoop distributes the same job across the cluster and gets done within a limited time that runs on commodity hardware. Timing and money save is the ultimate goal of any business. This is how Hadoop works in big data. By reaching the end of this post, I hope you people have gained enough knowledge on working with the Hadoop Ecosystems. You can get practical knowledge on Hadoop from Realtime Industry professionals through Hadoop Online Course. In the upcoming post of this blog, I'll be sharing a detailed explanation of each component of the Hadoop file system. You can also check out our Hadoop Interview Questions prepared by experts in our website.
Continue reading
How does tableau work?
Data Analysis is the art of presenting the data in a manner that even a non-analyst can understand. A perfect blend of aesthetic elements like colors, dimensions as well as labels is good at creating visual masterpieces, that reveal the surprising business insights that in turn helps the businesses to make the informed decisions. Data Analysis is an unavoidable part of business analytics. Since more and more sources of data we're getting discovered, business managers at various levels use the data visualization software that allows them to analyze the trends quickly visually and take quick business decisions. Tableau is one of the fastest-growing business intelligence and data visualization tools. In this blog post, today we were going to discuss the working of tableau in real-time. Tableau is a business intelligence tool for the visual analysis of data. Through tableau, users can create and distribute an interactive and shareable dashboard. Through this business intelligence tool, we can depict the trends, variations, and density of data that can be represented in the form of charts and graphs. Through this tool, users can connect the files, relational databases, and other big data sources to acquire and process the data. This software allows data blending and real-time collaboration that makes it unique. This data analysis is used by businesses, academic researchers, and many government organizations for visual data analysis. Are you new to the phrase tableau? If so check out our post on What is tableau? How does tableau work? The working of data in tableau with the real-time data can be understood through the following steps: Tableau offers five different products to diverse the visualization needs for professionals as well as organizations. They are: Tableau Desktop: Made for Individual use Tableau Server: Collaboration for any organization Tableau Online: Business Intelligence in the Cloud Tableau Reader: lets you read the files saved in Tableau Desktop This business intelligence tool has the following highlights: Tableau Public and the tableau reader were free to use, while the tableau server and the tableau desktop come with the 14 days fully functional trial period. Once the trial period is completed the user will be charged as per the package. Tableau Desktop comes with both the professional as well as the personal edition at a lower cost. Besides tableau online is available with an annual subscription for a single user and scales to support thousands of users. Users can get the desktop version of tableau from their official website, and get full access to the various options for 14 days. Once the trial period finishes, the data visualization can be done with tableau public where the user’s data will be shared publicly. Once you install the software on the machine you can start the data visualization journey. Once you logged in to the tableau desktop, the starting page is divided into 7 sections: Connect to the File: This section helps you to connect with files, which allow you to extract data from different sources such as Excel, Text, Spatial Files, PDF, and so on. Connect to the Server: This section helps you to connect with the Tableau Servers (or) allows you to extract data from different servers such as SQL Server, MySQL, and Tableau Server, and so on. Saved Data Sources: This section contains the existing (or) Saved Data Sources. Open: This section contains the most recently used workbook under this section. Sample Workbooks: These were the sample workbooks that come with the tableau desktop installation. Training and Videos: This section contains some useful blogs as well as videos. Resources: This section contains the content generated by the tableau community. Note: If the server is not present under connect to the server section, then click more. A hyperlink that shows the list of supporting servers. Do you want to get a practical explanation of the tableau? If so, visit Tableau Online Training What are the exciting features of tableau? Tableau provides the solution for all kinds of industries, environments as well as departments. The following are the highlighting features of tableau that enable to handle the diverse scenarios: Centralized Data: The tableau server provides the centralized location to manage all the organizations published data sources. Through this centralized data, users can delete, change permissions, add tags and manage schedules in one convenient location. Through this centralized data, users can schedule, extract refreshes, and manages them in the data server. In addition, administrators can centrally define a schedule for extracts on the server for both full and incremental refreshes. Self-Reliant: This business intelligence tool does not require a complex software setup. The desktop version is opted by most users that can be installed easily and contain all the features needed to start and complete data analysis. Visual Discovery: This tool is good at exploring and analyzing the data from different tools like graphs, colors, and trend colors. Many options in this tool were drag and drop and require a small piece of code. Architecture Agnostic: Tableau works well with all kinds of data where the data flows. Hence the user need not to worry about the specific hardware (or) software requirements. Real-Time Collaboration: Tableau can sort, filter, and discuss the data on the fly and can also embed the live dashboard using different portals like Salesforce (or) Sharepoint. In addition, you can save and view your data, allow the colleagues to subscribe to your interactive dashboards where the subscribers can see the latest data just by refreshing the browser. Blend Diverse Data Sets: Tableau allows you to blend different relational, semi-structured, and rata data sources in real-time, without an expensive upfront integration cost. In addition, the user does need not to know the details of how the data is stored. Likewise, there are many highlighting features of the tableau. By reaching the end of this blog post, I expect you people have gained enough information on tableau working in real-time. Readers can get a practical explanation of this by real-time experts through Tableau Online Course. In the upcoming post of this blog, I'll be sharing some additional features of the tableau. Meanwhile, you can also check out our Tableau Interview questions prepared by experts on our website.
Continue reading
What is Ab initio?
Data plays a major role today. Whether it’s a start-up (or) a well-established company data is essential. Depending on the size of the company, people store this data in various places like a data warehouse, data mart with secured encryption. Since the data gets generated from different sources, it would be in multiple formats like xlsx, PDFs,.doc. .txt . So we need to convert these multiple forms of data to a single format and then remove the redundancies and finally sent them to the data marts. To perform all these functions IT companies use ETL tools like Ab initio to process the data. What is the need for Business Intelligence? Running a business in the IT industry is like walking in a dark room. In simple terms, Business Intelligence (BI) is the process of deriving meaningful information (or) insights from raw data. From the past one year, it has gained high significance in many verticals across the globe. Today many BI tools like Ab initio, Informatica, Cognos, Data Stage were widely used in the market. Today in this article. I'm going to discuss with you the need and importance of ab initio when compared to the other bi tools in the market. Let us start our discussion with What is ab initio? Ab initio stands for Start from the beginning. It is a business intelligence platform comprises of several data processing products. This platform has a powerful GUI based parallel processing for ETL management and data analysis. This platform works with the client- Serer model. Here the client is a Graphical Development Environment (which is an IDE) that resides on the client system. And the server system is called Co-Operating System. This Co-Operating System resides on the mainframe (or) the remote UNIX system. The abinitio code is known as a graph that has a .mp extension. In ab initio etl the graph from the GDE needs to be deployed in the corresponding .ksh version. This Ab initio ETL platform provides a robust architecture that allows simple, fast, and highly secure integration of systems and applications. This tool is capable of running heterogeneous applications parallelly over the distributed networks. Besides, it is capable of integrating diverse, complex, and continuous data streams ranging from gigabytes to petabytes to provide both ETL and EAI tasks within a single and consistent framework. Moreover, it is capable of integrating arbitrary data sources and programs and supplies entire metadata management across the enterprise. Besides, it is capable of integrating arbitrary data sources and programs and supplies the entire metadata management across the enterprise. This Business Intelligence tool solves the most challenging data processing issues for leading organizations in many areas like telecommunications, finance, insurance, eCommerce, retail, transport. In these areas, ab initio solutions are constructed and employed incredibly speedily and provide the best performance and scalability. Know more on Ab initio at Ab iniito Online Training Architecture: This Ab initio business Intelligence software consists of 6 different processing products. They are: Co-Operation Systems: It is the root for all ab initio functions and base for all the ab initio process. And this platform is capable of running different environments like AIX, HP-UX, Linux, Solaris as well as Windows Component Library: It is a reusable program module for sorting, data transformation, and excessive space database loading and unloading. It is a flexible and extensible tool that adapts at run time to the various formats of data entered. Graphical Development Environment: Graphical Development Environment provides an intuitive graphical user interface for editing and executing applications. Here you can easily drag and drop components from the library to the canvas and configure them and then connect to the canvas. Besides, its GDC graph compilation system comes with the new release of UNIX Shell script that can be executed on the machine where GDE is not installed. Besides, it provides an easy to use front end application for designing ETL graphs. It facilitates to run and debug ab initio jobs and traces the execution jobs. Enterprise Meta Environment: It is a data store that allows the tracking of changes in developed graphs and the metadata used in their development. Besides, it offers tools such as dependence analysis, metadata management, statistical analysis as well as version controlling. It is capable of storing both technical and business metadata. And this data can be accessed from a Web browser, Ab initio GDE, and Co-Operating System command line. Data profiler: It is an analytical application that can specify the data range, scope, distribution, variation as well as quality. It can run in a graphical environment, on top of the co-operating system. Condcut>It: It is high volume data processing systems developing tool. Besides,it allows the user to combine the graphics from a graphical development environment with custom scripts anf programs from various vendors. Besides, it provides both graphical and command-line interface to Conduct>it. Why Should you Ab initio over the other? Even though many ETL Tools were available in the market, people opt for ab initio due to the following reasons: One-Stop Solution: This tool provides a one-stop solution to a wide range of data processing solutions. Performance: It is capable of handling distributed processing as well as processing and loading the data in real-time. And its parallelism techniques process the data much faster. Reliable Customer Base: It is the tool used in big data industries like insurance, banking, logistics, stock market, retail, finance to process the complex and enormous volume of data. Development Time: The development time in handling the errors is less than many of its competitors. Efficiency: This platform provides a lot of features from the built-in components. Here the data is sorted and processes quickly. Here the parallel processing and error processing parameters were highly useful Easy Maintenance: The maintenance of these ETL tools is much easier and cheaper than many other ETL tools. Moreover, here the transformations were most advanced as well. Likewise, there are many reasons to use ab initio when compared to the other ETL tools in the market. By reaching the end of this blog, I hope you people have acquired the best knowledge on ab initio ETL and its application in the real-time industry. In the upcoming post of this blog, I'll be sharing the details of the interaction of real-time data through ETL. You people can get practical knowledge on ab initio through Ab initio Online Course . Also, check our later internet questions at Ab initio Interview Questions and get placed in an MNC.
Continue reading
What is Big Data?
Big Data has become the buzzword over the past few years. Do you know why this word has become a buzzword? Why big data becomes more popular? Are you curious to get answers to all these questions? Read the complete article to get answers to all those questions Before talking about this buzzword, let us initially discuss, What is Data? The Quantities, character (or) symbols on which operations the computer can perform the operations is known as Data. In computer, data gets stored, transmitted in the form of electrical signals and gets recorded on magenetic, optimal, mechanical recording media. So now let us move on to the actual topic What is Big Data? Big Data is the term which describes the huge volume of data. This huge volume of data includes both structured as well as unstructured data. In organization these volumes of data get generated on day to day basis through various sources. Moreover, organizations does not require this entire data. This is because this data may contain the reduntant as well the useless data. So the organizations segregate all these data into useful information as per their needs. In other words, this big data is also refer to the fast, larger (or) complex data that is impossible to process with the traditional methods. Even though the process of storing the large amount of information is happening from the long time, the concept of big data came into existence due to 3V’s in early 2000. These 3V’s refer as follows: Volume: Organization gets the data from various sources. In some cases, the data can be in terms of tens of terabytes. And in somecases it might be in terms of hundreds of petabytes. This includes business transactions, small IOT devices, industrial equipment, videos, social media, and so on. But, storage has become the most common problem in the previous decade, but today with the emergence of storage platform like Hadoop have eased this problem. Here the size of data plays a major role in determining the cost of data analysis. On the basis of volume, we can say that this volume is big data (or) not. Hence we can say volume is one of the most important characteristic in dealing with big data. Velocity: It refers to the rate with which the data is received and perhaps acted upon. Normally biggest velocity of data steams directly to be memory rather than being written on the disk. Moreover, some internet-enabled smart products operate in real time(or) near real-time and require a real-time evaluation as well as action. Moreover, this big data deals with the speed at which the data flows from different sources. These different sources includes business process, application logic, networks, social media sites, mobile devices and so on. Here the data flow is massive and continuous. Variety: Data in the Big data gets generated from multiple sources in various formats. Here the data formats can be as follows: Types of Big Data: Structured: Any data that can be stored, accessed, and processed in the form of fixed-format is known as Structured data. Over a period of time, the talent in the computer science has achieved great success in developing the techniques to work with such kind of data. But today there are some spacing issues in dealing with huge amounts of data( i.e data in the form of multiple zeta bytes). Unstructured: Any data with unknown form (or) structure is known as unstructured data. Other than dealing with the huge amount of data, unstructured data poses multiple challenges in its processing for deriving the value of it. In other words, an unstructured data is the heterogenous data source. This data source contains a combination of simple text files, images as well as videos. Moreover, the unstructured data may also includes log files, transaction history file and so on. In some organization even though the data is present in large amounts, they were not in a position to derive results. This is due to presence of data in irregular format. Semi- Structured: A semi structuted data contains data in both the forms. Moreover, we can say that the semi strcuted data as structured form of data but not actually defined. For instance, the table definition in the Relational Data base management Systems (RDBMS). Get more information on big data at Big Data Online course How Big Data works? Big data provides the way to open up the new opportunities and business models. This involves three phases as follows: a)Integrate: Big data is responsible to bring the data from different sources and applications. Traditional data integration mechanisms such as ETL generally were not up to the mark. And it requires new strategies and technologies to analyze big data sets in terms of terabytes (or) the petabytes of data. Moreover, during the integration, the analyst is responsible to bring the data in a single format before the data analysis starts b)Manage: As mentioned above, big data contains data in large volume. Hence the big data contains a large volumes of storage. And this storage can be on local (or) on-premises (or) on both. Moreover, through big data you are responsible to store any amount of data in any format as per the demand. Most of the people choose their storage solution according to the data currently residing. According to the recent stats, cloud is gradually popularity as they the flexible to scale the resources as per the requirement c)Analyze: Through big data,people can get the best analysis of the business. Moreover, people do get the visual data analysis of varied data analysis.Besides you can explore the data to make the new discoveries Benefits: Processing the big data includes multiple benefits as follows: a)Intelligence utilization and decision making: Access to social media sites like facebook and twitter are enabling organization to fine tune their business strategies b)Improved Customer Service: Traditional customer feedback systems were replaced by new systems designed with these technologies. Moreover, in these new systems Big Data and natural processing technologies are being used to read and evaluate the customer responses. Moreovoer, this technologies are responsible for the creation of staging area (or) landing the zone for new data. Hence likewise, there are many use cases of big data. And you people can get hands-on experience on big data from live experts through online at Big Data Online Course. I hope you people have got an enough idea regarding on big data. In my next article of this blog ill be sharing the knowledge on big data applications in real world.Mean while have a glance at our Big Data Interview Questions and grab the job in your dream firm
Continue reading
What is Cognos?
The availability of the internet has made many changes in IT people’s life. Data generation is one of those. Today data gets generated exponentially. This data may contain useful as well as useless data. Besides this data may also contain redundant data. So we need to analyze this data and remove duplicate data. And this filtered data may contain raw data in different formats. And we need to convert these multiple forms of data into a single form. Processing all these manually is a tedious task and requires high effort. Hence to get rid of all these problems we need a tool to perform all those. Then Cognos came into existence to do all these tasks. Hence let us have a quick look at those Before going to about what is Cognos, let us have a quick look at Why do we need Cognos? Any enterprise (or) an individual needs this tool to automate the data analysis as well as the Visualization task. This tool also helps in taking the predicting the future trends that could ensure in taking the key business decision. What is Cognos reporting tool? IBM’s Cognos is a web-based reporting and analytics tool. This tool helps you to perform data aggregation and create user-friendly detailed reports. This tool offers an option to export the report in XML (or) PDF format. This analytics software tool was founded in 1969 by Alan Rushforth and peter Glenister. It has begun as a consulting firm for the Canadian government and offers its first Software product under the name QUIZ in 1979. This software enables business users without technical knowledge to extract the corporate data to analyze it and assemble reports. Since Cognos is built on open standards, we can utilize this software products with relational and multidimensional data from multiple sources like Microsoft, NCR Teradata, SAP, and Oracle. This business intelligence performance management tool for IBM allows technical and non-technical employees to analyze, extract, and create interactive dashboards that enable companies to take business decisions. Cognos contains three dozens of software products. Since it consists of several different products, it enables communication with different third parties. This intelligence platform provides an analytical solution for business that is scalable and self- service. Moreover, this framework interactive feature makes it a good way of creating a user-friendly dashboard and reports for every company. This powerful business intelligence tool suits well for data mining, data analysis, event monitoring, metric collection for the visualization of data. Hence for any business to stay ahead in the market it gives powerful medium analytics to predict the market trends and take appropriate actions Would you like to know the working of this tool practically, then visit Cognos Online Training What can you do with Cognos? Cognos allows the creation of intelligent interactive dashboards to make the business informed decisions. Since the system is inbuilt with machine learning and AI, this tool is capable of data creation and analysis and also enables users to get relevant answers to the questions How does Cognos make the working easy? This business intelligence tool allows users to create interactive dashboards. Moreover, these tools are capable of predicting changes in the market. Today many analysts suggest the newbies that this tool is capable of taking the right decisions at the right time and make your firm on the top of the market. Moreover, many companies opt for these business intelligence tools due to their features. Let us have a quick look at those features. Cognos Features: a)Since the tool is equipped with machine learning and Artificial Intelligence, it enables us to make future predictions and create intelligent dashboards. b)It uses the pattern detection property to discover the hidden patterns in data that could not be unheard (or) not expected in normal circumstances c)As mentioned above, this tool is not only capable of creating the interactive dashboards, It has a capability of creating the interactive dashboards in multiple formats. Hence it enables the stakeholders to analyze the way they require and helps in the business making process. d)Natural language processing is a way to extract information from raw text and make intelligent predictions. So Cognos using natural language powered by AI assist users to get the intelligent response to the questions posted by them. The manual task usually takes time and effort. But this Cognos BI tool eliminates the need for human interventions by automating the data preparation process through its built-in ecosystem Advantages of Cognos: Cognos has several advantages. Let us discuss some of them a)Data Preparation: Files of different formats like CSV, spreadsheets can be easily accessed and uploaded. It also helps in finding the relevant data sources with the help of its intelligent system using natural processing language. Besides, these tools are also capable of automation of the integration of different sources. b)Data Explorations: The Data could be visualized and reported in a professional manner using Cognos. Also, its intelligent features allow plotting the correct chart for the particular business problem. Moreover, this tools contains various it's geospatial features in the dashboard. c)Data Sharing : Once the data is prepared and explored, it could be shared over different cloud platforms (or) Cloud. It also enables users to report subscriptions. Moreover, this platform integrates different charts that can integrate to create a story using various features like Voice overs, overlays, and so on. Disadvantages of Cognos: Even though there are many beautiful advantages of Cognos, there are some minor disadvantages as mentioned below: a)This tool does not support multi-layer dimension analysis b)These tools cannot be accepted very eagerly in departmental (or) the divisional employments. We cannot except 100 % pros everywhere. Each tools has some pros and cons. Irrespective of the above mentioned minor cons, this tool suit best in the analysis platform. I hope you people have got enough idea regarding the Cognos reporting tool. You people can get practical knowledge of this tool by live industry experts through the Cognos Online Course. In the upcoming articles of this blog, ill be sharing the details of the installation, components, and so on. Meanwhile, have a glance at our Cognos Interview Questions and crack the interview.
Continue reading
What is Data Science?
As big data came into the picture, storage has become a major concern in the IT world. This storage has taken as the primary concern since 2010. It is taken as the primary consideration due to increase in rapid exponential amount of data. And we cannot clone this data whenever its utilization was finished. Because, there are many chances for the re utilization of its data. So we need to store this data for future utilization. An analyst usually filters this data and utilizes this as per the requirement. Do you know "how do analysts filter this data"? Also, Are you aware of "Which algorithm is used to analyze this huge amount of data"? If no read this complete article on Data science and get answers to all these questions. Let us start knowing about data science through data science definition What is Data Science and analytics? Data Science is the blend of various tools, algorithms, and machine learning principles. Its goal is to discover the hidden patterns of data. It is primarily used to make business decisions and predictions. As mentioned you earlier, data gets generated from various sources. This includes financial logs, text files, multimedia forms, sensors as well as instruments. Simple BI tools were not capable of analyzing this huge volume as well as a variety of data. Hence there is a need for advanced complex and advanced analytical tools and algorithms for processing, analyzing, and drawing meaningful insights of it. So here data science came into the picture with various algorithms to process this huge amount of data. It makes use of predictive casual analytics, perspective analytics, and machine learning. Get more information on Data Science by live experts at Data Science Online Training let us have a quick look at those briefly. Predictive casual analytics: If you want a model that can predict the possibilities of the particular model in the future, predictive casual analytics comes into the picture. For example, if you are providing the money on a credit basis,then the probability of making credit card payments on time comes into the picture. Here you can build a model that can perform predictive analytics based on the payment history of the customer to predict the future payments of the customer. Perspective Analytics: This analytics comes into the picture if you want a model that has the intelligence of taking its own decisions. In other words, it not only predicts but also suggest the range of prescribed actions and the associated outcomes. The best example of this analytics is self-driving cars. Here the data gets generated by vehicles to train the self- driving cars. You can algorithms on this data to bring intelligence to it. Using intelligence with the data, it can make better decisions in different situations like taking U-turn, car reversing, speed regulation, and so on. Machine learning for making decisions: If you have the transactional data of the finance company and need to build a model to determine the future trend then machine learning algorithms comes into the picture. This machine learning comes under supervised learning. It is so-called supervised machine learning because you have data where you can train your machines. For instance, a fraud detection model can be trained using the historical data of the fraudulent purchases. Who is a Data Scientist? Data scientists can be defined in multiple ways. One of them is as follows: A Data Scientist is the one who practices and implements the Data Science art. Data Scientist roles combine computer science, statistics, and mathematics. They analyze the process as well as model the data and interpret the results to create actionable plans for companies and other organizations. Data Scientist were the analytical experts who utilize the skills in both technology and social science to find trends as well as manage the data. What Does Data Scientist do? A Data Scientist work typically involves making a sense of messy, unstructured data from various sources like smart devices, social media feeds, and emails that don’t fit into the databases. A data scientist usually cracks complex problems with their strong enterprises in certain disciplines. A Data scientist usually works with several elements related to mathematics, statistics, computer science, and so on. Besides, these people use a lot of tools and technologies in finding solutions and reaching solutions that were crucial for organization growth and development. Data Scientist presents the data in much useful form when compared to the raw data available to them from both structured as well as unstructured form. Life Cycle of data science: The life cycle of data science involves various activities as follows: a)Discovery: Before beginning your project, it is important to understand various specifications, requirements, priorities, and required budget. Here you should assess yourself whether you have the required resources present in terms of people, technology, time, and data to support the project. Moreover, here you need to frame the business problem and formulate an initial hypothesis to test. b)Data preparation: In this phase, you require a sandbox, where you can perform the analytics for the entire project. Besides, you need to explore, pre process, and condition data before modeling, Besides, you will perform ETL(Extract, Transform, Load) to get data into the sandbox. c)Model planning: Here, in this phase, you will determine various methods and techniques to draw the relationships between the variables. These relationships will set the base for the algorithms which will be implemented in the next phase. Here you will apply exploratory data analysis using statistical formulas and visualization tools. d)Model Building: In this phase, you will develop the data sets for training as well as testing purposes. Moreover, you will be checking whether your existing environment suits get for running the models. Besides, you will also analyze various learning techniques like classification, association, and clustering to build the model. e)Operationalize: In this phase, you will deliver final reports, code, and other technical documents. Besides, in some cases, a demo project is also implemented in a real-time project. So with this demo project, you will be getting an idea of the project outcome and also the probable loopholes of the project. f)Results Communication: We can consider this phase also as an verification phase. Here in this phase, you will be evaluating your project success. i.e checking your goals whether they meet the project requirement (or), not? that was expected in the first phase. Besides, in this phase, you will be also thinking of various findings, communication to the stakeholders and determines the outcome of the project based on the criteria developed in the first phase. Hence with this, the project of the life cycle of the data science goes on. You people can get the practical working of this data science cycle at the Data Science Online Course. With this, I hope you people have got enough ideas on data science overview, life cycle, and so on. In the upcoming articles, I will be sharing the details of applications of data science in various fields with practical use cases. Meanwhile, have a glance at our Data Science Interview questions and get placed in your dream company.
Continue reading
What is Hadoop?
In the previous articles of this blog, we people have seen the need and importance of big data and its application in the IT industry. But there are some problems related to big data. Hence to overcome those problems, we need a framework like Hadoop to process the big data. This article on Hadoop gives you detailed information regarding the problems of big data and how this framework provides the solution to bigdata. Let us discuss all those one by one in detail Importance of Big data: Big data is emerging as an opportunity for many organizations. Through big data, analysts today can get the hidden insights of data, unknown correlations, market trends, customer preferences, and other useful business information. Moreover, these big analytics helps organizations in making effective marketing, new revenue opportunities, better customer service. Even though this bigdata has excellent opportunities, there are some problems. Let us have a look at Problems with Big data: The main issue of big data is heterogeneous data. It means the data gets generated in multiple formats from multiple sources. i.e data gets generated in various formats like structured, semi-structured, and unstructured. RDBMS mainly focuses on structured data like baking transactions, operation data, and so on. Since we cannot expect the data to be in a structured format, we need a tool to process this unstructured data. And there are ‘n’ number of problems with big data. Let us discuss some of them. a)Storage: Storing this huge data in the traditional databases is not practically possible. Moreover, in traditional databases, stores will be limited to one system where the data is increasing at a tremendous rate. b)Data gets generated in heterogeneous amounts: In traditional databases, data is presented in huge amounts. Moreover, data gets generated in multiple formats. This may be structured, semi-structured, and unstructured. So you need to make sure that you have a system that is capable of storing all varieties of data generated from various sources. c)processing speed: This is a major drawback of leaving the traditional databases. i.e accessibility rate is not proportional to the disk storage. So w.r.t to data increment, access rate is not increasing. Moreover, since all formats of data present at a single place, the accessibility rate will be inversely proportional to data increment. Then Hadoop came into existence to process the unstructured data like text, audios, videos, etc. But before going to know about this framework, let us have an initially have a look at the evolution Evolution: The evolution of the Hadoop framework has gone through various stages in various years as follows: a)2003- Douge cutting launches project named nutch to handle billions of searches and indexes millions of web pages. Later in this year, Google launches white papers with Google File Systems(GFS) b)2004 – In December, Google releases the white paper with Map Reduce c)2005 - Nutch uses GFS and Map Reduce to perform operations d)2006 - Yahoo created Hadoop based on GFS and Map Reduce with Doug cutting and team. e)2007 - Yahoo started using Hadoop on a 1000 node cluster. f)2008 - yahoo released Hadoop as an open-source project to an apache software foundation. Later in July 2008, apache tested a 4000 node with Hadoop successfully g)2009 – Hadoop successfully stored a petabyte of data in less than 17 hrs to handle billions of searches and index millions of webpages. From then it has been releasing various versions to handle billions of web pages. So till now, we people have discussed regarding the evolution, now lets us move into the actual concept What is Hadoop? Hadoop is a framework to store big data to process the data-parallelly in a distributed environment. This framework is capable of storing data and running applications on the clusters of commodity hardware. This framework was written in JAVA. It is capable of batch processing. Besides this framework is capable of providing massive storage for any kind of data with enormous computing power. Moreover, it is also capable of handling virtually limitless tasks (or) jobs. This framework is capable of efficiently storing and processing large datasets from gigabytes to petabytes of data. Instead of using one large computer to store and process the data, Hadoop allows clustering multiple computers, to analyze the massive data sets in parallel more quickly. Here the data is stored on inexpensive commodity servers that run as a cluster. Its distributed file system enables concurrent processing and fault tolerance. This framework uses map reducing programming model for faster data storage and its retrieval from its nodes. Today many applications were generating the big data to be processed, where the Hadoop plays a significant role in providing a much-needed makeover to the database world. Get more information on big data by live experts at Hadoop Online Training This framework has four components as mentioned below: HDFS – This stands for Hadoop Distributed File Processing Systems. This framework allows you to store data of various formats across the cluster. This component creates the abstraction. Like Virtualization, you can see HDFS, as a single unit for storing big data. This framework uses a master-slave architecture. In HDFS is Name node is the master node and Data nodes is the Slave node. Name node contains the metadata about the data stored in Data nodes such as which data block is stored in which data node. Here the actual data is stored in data nodes. Moreover, this framework has a default replication factor of 3. Hence due to the utilization of commodity hardware, if one of the data nodes fails, HDFS will still have a copy of the lost data blocks. Moreover, this component also allows you to configure the replication factor based on your requirements. YARN: YARN stands for Yet Another Resource Negotiator. It is a Hadoop resource management. This component acts as an OS to the Hadoop. This file system is built on the top of HDFS. It is responsible for managing the cluster resources to make sure that you don’t overload one machine. It performs all your processing activities by allocating the resources and scheduling the tasks. It has two major components i.e Resource Manager and Node Manager. Here the Resource Manager is again a master node. Here the Node Managers were installed on every Data Node. It is responsible for the execution of the task on every single data node. In the node section, each node has its node managers. Here the node manager manages the nodes and monitors the resource usage in the nodes. It receives the processing request and then passes the parts of the request to the corresponding node managers. Here the actual processing of the data takes place. Here the containers contain the collection of physical resources like RAM, CPU (or) the hard drives. Map Reduce: a)It is a framework that helps the JAVA programs to do the parallel computation of data using Key-value pairs. Here the map is responsible for taking the input data and converts into the dataset that can be computed in a key-value pair. Here the output of the Map is consumed by the reducer, where there the reducer gives the desired result. So in Map-Reduce approach, the processing is done at slave nodes and the final result is sent to the master node. Moreover, the data containing the code is responsible to process the entire data. Here the coded data is small when compared to the actual data. Here the code to process the data inform of Kilobytes. Here the input is divided into small groups of data called Data Chunks. Likewise, each component of this framework has its own function in processing big data. You people can get the practical working of this framework by live experts with practical use cases at Hadoop Online Course Final Words: By reaching the end of this blog, I hope you people have got on Hadoop and application in the IT industry. In the upcoming post of the blog, I'll be sharing with you the details on Hadoop architecture and its working. Meanwhile, have a look at our Hadoop Interview Questions and get placed in a reputed firm
Continue reading
What is MSBI?
The need for Business Intelligence tools does not exhaust as long as IT and the internet exist around us. So many business intelligence vendors were adding more and more features to these tools for quick analysis of the data. This article on MSBI is another example to let you know their need and importance of business intelligence tools in the market. Without wasting much time, let us move into the actual topic. What is MSBI? It is a Microsoft business intelligence tool. The business intelligence tool is capable of providing the ultimate solutions to execute data mining and business queries. Besides, this tool provides various types of data access to the companies where the companies can take business decisions and can also plan for the future. Moreover, with this tool, companies can take better decisions and can also plan for the future. By default, these business intelligence tools provide some plans to work with the data and analyze it. Besides, this tool also allows you to implement the new data as well. This powerful suite is composed of many tools that help in providing the best solution for business intelligence and data mining queries. This tool uses Visual Studio and SQL Server. Besides, it offers different tools for different processes that are necessary for Business Intelligence solutions. Moreover, this Microsoft tool is capable of understanding complex data, allocating, analyzing, and setting up a proper report that helps in taking business decisions. Why MSBI is necessary? Many data analyst/scientist use MSBI due to the following reasons: a)This Microsoft tool is capable of storing as well as retrieving the data to take the smart decisions b)This Microsoft BI suite consists of several resourceful tools to provide the best solution for business intelligence. c) Since the market is lagging the SSAS, SSIS, SSRS professionals, demand for this business intelligence tool has increased much. d) To get different tools for the different process that are necessary for the business intelligence solutions Besides, there are many other reason to learn MSBI to be a successful IT professional. Are you curious to know what are those? then visit MSBI Online Training. Architecture: This Microsoft business intelligence tool has 3 components. They are : SSIS – SQL Server Integration Services SSAS- SQL Server Analytical Services SSRS –SQL Server Reporting Services Let us discuss each of them briefly: SSIS - As the name suggest, this component is useful to integrate the data coming from different sources. This component is responsible for the integration data ware housing. Since the data gets collected from various sources to integrate, this component uses Extract, Transform and load (ETL) process to store the data. Moreover, this phase is responsible to store the data from different locations, ingrate it and ultimately store the data in a data ware house. This tool is capable of building high performance integration and workflows. Besides this tool contains various graphics tools and wizards for building packages. In simple words, this component suits best for bulk transaction. And this is useful to generate the trend reports, predictive analysis and the comparison reports. Hence this tool fits best for the business analyst in making the quick decisions. SSAS- This is the process of converting the two dimensional data into multidimensional data model. This tool suits best in analyzing large volumes of data. Besides, this tool is responsible to analyze the performance of SQL Server in terms of loading balancing, heavy data, transaction. The main responsibility of this tools is to develop Online Analytical Processing Solutions (OLAP). Hence this analytical tool fits in the for the administrators to analyze the data.Moreover, with this analytical tool, admin analyzes the data before moving in to the database. Besides, user can also get the details of number of transactions happen in a sec. This SSAS has many advantages. Some of them were multi-dimensional analysis, Key performance indicator, Score card, Good performance, Security and so on. SSRS: As the name indicates, this tool is responsible to prepare the reports that contains virtual. This reporting platform presents the modern as well as he traditional reports through suitable (or) the custom applications. This component is platform-independent and efficient. Moreover, it is also capable of retrieving the data from various sources and can export the functionality in lot of formats.Besides, this tool has access to the web based reports. Hence it is capable of display the reports in the form of guague, tabular, chart and many more.This SSRS has many excellent benefits. Among them the popular services were retrieving data from multiple sources, Support for ad-hoc reporting, export functionality with variable formats and so on. MSDN Library: It is a collection of sites for the development team to provide documentation, information as well as the discussion that is delivered by Microsoft. Here, Microsoft have give more importance on the incorporation of forums, blogs, social bookmaking, library annotations. What are the features of MSBI? There are many features of MSBI. Let us discuss some of them: Provides a singe value of truth to make the effective decisions Eliminates (or) reduces the ‘Instinctive decisions’ Provides a quick and timely answers to the business and making it more responsive to the business trends Minimize the manual work Support for the historical and summarized data Robust support for advanced analysis Hence likewise, there are many features of this Microsoft business intelligence tool in real time .You people can acquire hands on experience on this Business intelligence tool by live experts through online at MSBI Online Course. I hope you have got a basic regarding the need and utilization of MSBI in IT industry.In the upcoming post of this blog, ill be delivering in detail information of each component individually. Meanwhile have a glance at our MSBI interview Questions and make get placed in an MNC.
Continue reading
What is Tableau?
Tableau Software is the fastest-growing data visualization tools that are currently in use in the BI Industry. This business intelligence tool is best for the transformation of raw data into an easily understandable format. People can easily analyze this tool with zero technical skills and coding knowledge. This article starts with data visualization and the importance of tableau as a Data Visualization tool. Are you looking for the same? Then this article is for you! without wasting much time, let's start our discussion with What is Data Visualization? Data visualization is the art of representing the data in a manner that a non-analyst can even be understood. Elements like Colours, labels. Dimensions can create masterpieces. Hence the surprising business insights help people to make informed decisions. Data visualization is an important part of business analytics. Since the data from various sources were discovered, business managers at all levels, managers can analyze the trends visually and take quick decisions. Among the multiple data visualization tools that were available today in the market, Tableau is one of the best business intelligence(BI) as well as the data visualization tool. What is Tableau? Tableau is one of the fastest-growing Business Intelligence(BI) and data visualization tools. It is very fast to deploy, easy to learn, and very intuitive to the customer. Any data analyst who works with tableau helps people to understand the data. Tableau is greatly used because data can be easily analyzed. Also, all the visualization was referred to as dashboards as well as the worksheets. Tableau online allows one to create dashboards that provide actionable insights and drive the business forward. Tableau business intelligence products always operate in virtualized environments where they were configured with the proper underlying system as well as the hardware. It is used to explore the data with limitless visual analytics. A tableau reporting tool helps to convert your textual as well as numerical information to beautiful visualizations through interactive dashboards. It is so popular, fast, interactive, dynamic, and has a huge fan base in the public as well as the enterprise world. Moreover, it has effective documentation for each issue and has the steps to solve the issue. Would you like to know the practical analysis using Tableau Business Intelligence? then visit Tableau Online Training Why Tableau? A tableau is software that helps to understand the data patterns and provide a visual representation to them. Tableau analysts need to understand the patterns and derive meaningful insights and use statistics to represent data and clarify the finding to the business people who do not have the technical knowledge. Tableau Analytics helps non-technical people to understand the data and make data-driven decisions to help in the organizations. Since people can analyze the data quickly when compared to the things present in reports, tableau suits best for business analysis in that manner. Many Analysts say that tableau bi is the best tool for business analysis and stands as one of the popular data visualization tools in the industry. Moreover, when compared to tableau pricing with other business intelligence tools, tableau cost less in the IT industry. What are the products of Tableau? This tableau business Intelligence tool has the following products: Here the products were categorized into: a)Visualization Development Products: These include the Tableau desktop as well as the Tableau public b)Visualization Publishing Products: These include the Tableau Server, Tableau Reader, Tableau Online Let us have a look at all those: a)Tableau Desktop: It allows users to create, format, and integrate various interactive views and dashboards using the rich set of primitives. It is responsible for supporting the live up-to-date, data analysis by querying the data residing in various native as well as the live connected databases. The create visualizations are then published by sharing the tableau packaged workbook that has the extension .twbx. It comprises of 1)Tableau : This is a Workbook with an extension .twb is an XML document describing the visualization templates 2) Responsible for supporting other files like images etc b)Tableau Server: It is a secure, reliable as well as well governed enterprise-level environment to share and publish the visualization using tableau desktop. This server acts as a central repository for various data sources in data engines and accesses the privilege details across the firm. c)Tableau Public: It is a cloud-hosted free version with tool usage limitations. It has two public products namely tableau public desktop well as tableau public server. The limitations of tableau public are: 1)It supports the locally available data extracts 2) Allows the input of one million rows 3) Unlike the tableau desktop, users cannot save the report locally and they were restricted to save the workbook in the tableau server that is accessible to all its users. d)Tableau Online: It is a cloud-hosted sharing platform. It can connect with cloud databases like Amazon Redshift, google big query, etc. It refreshes the extracts and lives connection with on-premises data stores using tableau bridge. Unlike the tableau server, editing of workbooks, as well as the visualization, needs the data server connection and these operations were limited by maximum bound on row count. e) Tableau Reader: It is a desktop application that allows users to perform view interactions like drill down and roll up of OLAP cubes. But it cannot edit embedded content in the published visualizations that are built-in tableau desktop. Advantages of Tableau: The utilization of tableau reporting has the following advantages: Fantastic Visualization: You can work with a lot of data that does not have any order and create visualizations. Here you have the option of switching between various visualizations. Moreover, it also capable of exploring the data at a minuted level. In-depth Analysis: It helps enterprises to analyze the data without any specific goals in mind. Through this tool it allows you to explore various visualizations and have a look at the same data from different angles. You can frame for “What if “ queries and work with data by hypothetically visualizing it differently and adding the components for comparison as well as the analysis. User-friendly Approach: This is the greatest strength of tableau. It is built from scratch and suits the best people who do not have the coding experience. So everyone can use this tool, without any prior experience. Since most of it was drag and drop, each visualization is intuitive. Working with Disparate Data Sources: Tableau has a powerful reason to be included by various organizations where the data comes from disparate sources. Tableau is capable of connecting to the various data sources, data warehouses, and the files connecting with other data sources, data warehouses that exist in the cloud. It is also capable of blending all kind of sources to help the organization grow as well as the visualizations Adding a Data Set: Whether it is a database (or) an excel workbook, the tableau is capable of adding new data sets that can blend with tableau using common fields. Hence likewise, there are many advantages of this tableau visualization tool. By reaching the end of this blog, I hope you people have gotten enough knowledge on tableau regarding the need, application in the IT industry. You people can get practical knowledge on tableau visualization at Tableau Online Course. Also, check our latest Tableau Interview Questions and get ready for the Interview. In the upcoming articles of this blog, I'll be sharing the details of variations of various tableau products and their applications in the real world.
Continue reading
Ab Initio Interview Questions
Q.What is surrogate key? Answer: surrogate key is a system generated sequential number which acts as a primary key. Q.Differences Between Ab-Initio and Informatica? Answer: Informatica and Ab-Initio both support parallelism. But Informatica supports only one type of parallelism but the Ab-Initio supports three types of parallelisms. Component Data Parallelism Pipe Line parallelism. We don't have scheduler in Ab-Initio like Informatica , you need to schedule through script or you need to run manually. Ab-Initio supports different types of text files means you can read same file with different structures that is not possible in Informatica, and also Ab-Initio is more user friendly than Informatica . Informatica is an engine based ETL tool, the power this tool is in it's transformation engine and the code that it generates after development cannot be seen or modified. Ab-Initio is a code based ETL tool, it generates ksh or bat etc. code, which can be modified to achieve the goals, if any that can not be taken care through the ETL tool itself. Initial ramp up time with Ab-Initio is quick compare to Informatica, when it comes to standardization and tuning probably both fall into same bucket. Ab-Initio doesn't need a dedicated administrator, UNIX or NT admin will suffice, where as Informatica need a dedicated administrator. With Ab-Initio you can read data with multiple delimiter in a given record, where as Informatica force you to have all the fields be delimited by one standard delimiter Error Handling - In Ab-Initio you can attach error and reject files to each transformation and capture and analyze the message and data separately. Informatica has one huge log! Very inefficient when working on a large process, with numerous points of failure. Q.What is the difference between rollup and scan? Answer : By using rollup we cant generate cumulative summary records for that we will be using scan Q.Why we go for Ab-Initio? Answer : Ab-Initio designed to support largest and most complex business applications. We can develop applications easily using GDE for Business requirements. Data Processing is very fast and efficient when compared to other ETL tools. Available in both Windows NT and UNIX Q.What is the difference between partitioning with key and round robin? Answer: PARTITION BY KEY: In this, we have to specify the key based on which the partition will occur. Since it is key based it results in very well balanced data. It is useful for key dependent parallelism. PARTITION BY ROUND ROBIN: In this, the records are partitioned in sequential way, distributing data evenly in blocksize chunks across the output partition. It is not key based and results in well balanced data especially with blocksize of 1. It is useful for record independent parallelism. Q.How to Create Surrogate Key using Ab Initio? Answer. A key is a field or set of fields that uniquely identifies a record in a file or table. A natural key is a key that is meaningful in some business or real-world sense. For example, a social security number for a person, or a serial number for a piece of equipment, is a natural key. A surrogate key is a field that is added to a record, either to replace the natural key or in addition to it, and has no business meaning. Surrogate keys are frequently added to records when populating a data warehouse, to help isolate the records in the warehouse from changes to the natural keys by outside processes. Q.What are the most commonly used components in a Ab-Initio graphs? Answer: input file / output file input table / output table lookup / lookup_local reformat gather / concatenate join run sql join with db compression components filter by expression sort (single or multiple keys) rollup partition by expression / partition by key Q.How do we handle if DML changing dynamically? Answer: There are lot many ways to handle the DMLs which changes dynamically with in a single file. Some of the suitable methods are to use a conditional DML or to call the vector functionality while calling the DMLs. Q.What is meant by limit and ramp in Ab-Initio? Which situation it’s using? Answer: The limit and ramp are the variables that are used to set the reject tolerance for a particular graph. This is one of the option for reject-threshold properties. The limit and ramp values should pass if enables this option. Graph stops the execution when the number of rejected records exceeds the following formula. limit + (ramp * no_of_records_processed). The default value will be set to 0.0. The limit parameter contains an integer that represents a number of reject events The ramp parameter contains a real number that represents a rate of reject events in the number of records processed. Typical Limit and Ramp settings Limit = 0 Ramp = 0.0 Abort on any error Limit = 50 Ramp = 0.0 Abort after 50 errors Limit = 1 Ramp = 0.01 Abort if more than 2 in 100 records causes error Limit = 1 Ramp = 1 Never Abort Q.What are data mapping and data modeling? Answer: Data mapping deals with the transformation of the extracted data at FIELD level i.e. the transformation of the source field to target field is specified by the mapping defined on the target field. The data mapping is specified during the cleansing of the data to be loaded. For Example: source; string(35) name = "Siva Krishna "; target; string("01") nm=NULL("");/*(maximum length is string(35))*/ Then we can have a mapping like: Straight move.Trim the leading or trailing spaces. The above mapping specifies the transformation of the field nm. Q.What is the difference between a DB config and a CFG file? Answer : .dbc file has the information required for Ab Initio to connect to the database to extract or load tables or views. While .CFG file is the table configuration file created by db_config while using components like Load DB Table Q.What is mean by Layout? Answer: A layout is a list of host and directory locations, usually given by the URL of a file or multi file. If a layout has multiple locations but is not a multi file, the layout is a list of URLs called a custom layout. A program component's layout is the list of hosts and directories in which the component runs. A dataset component's layout is the list of hosts and directories in which the data resides. Layouts are set on the Properties Layout tab. The layout defines the level of Parallelism . Parallelism is achieved by partitioning data and computation across processors. Q.What are Cartesian joins? Answer: A Cartesian join will get you a Cartesian product. A Cartesian join is when you join every row of one table to every row of another table. You can also get one by joining every row of a table to every row of itself. Q.What is the function you would use to transfer a string into a decimal? Answer: For converting a string to a decimal we need to typecast it using the following syntax, out.decimal_field :: ( decimal( size_of_decimal ) ) string_field; The above statement converts the string to decimal and populates it to the decimal field in output. Q.How do we handle if DML changing dynamically? Answer: There are lot many ways to handle the DMLs which changes dynamically with in a single file. Some of the suitable methods are to use a conditional DML or to call the vector functionality while calling the DMLs.we can use MULTIREFORMAT component to handle dynamically changing DML's. Q.Explain the differences between api and utility mode? Answer: API and UTILITY are the two possible interfaces to connect to the databases to perform certain user specific tasks. These interfaces allow the user to access or use certain functions (provided by the database vendor) to perform operation on the databases. The functionality of each of these interfaces depends on the databases. API has more flexibility but often considered as a slower process as compared to UTILITY mode. Well the trade off is their performance and usage. Contact for Ab initio training Q.What are the uses of is_valid, is_define functions? Answers: is_valid and is_defined are Pre defined functions is valid(): Tests whether a value is valid. The is_valid function returns: The value 1 if expr is a valid data item. The value 0 if the expression does not evaluate to NULL. If expr is a record type that has field validity checking functions, the is_valid function calls each field validity checking function. The is_valid function returns 0 if any field validity checking function returns 0 or NULL. Example: is_valid(1) 1 is_valid("oao") 1 is_valid((decimal(8))"1,000") 0 is_valid((date("YYYYMMDD"))"19960504") 1 is_valid((date("YYYYMMDD"))"abcdefgh") 0 is_valid((date("YYYY MMM DD"))"1996 May 04") 1 is_valid((date("YYYY MMM DD"))"1996*May&04") 0 is defined(): Tests whether an expression is not NULL. The is_defined function returns: The value 1 if expr evaluates to a non NULL value. The value 0 otherwise. The inverse of is_defined is is_null. Q.What is meant by merge join and hash join? Where those are used in Ab Initio? Answer: The command line syntax for Join Component consists of two commands. The first one calls the component, and is one of two commands: mp merge join to process sorted input mp hash join to process unsorted input Q.What is data mapping and data modelling? Answer: Data mapping deals with the transformation of the extracted data at FIELD level i.e. the transformation of the source field to target field is specified by the mapping defined on the target field. The data mapping is specified during the cleansing of the data to be loaded What is the difference between sandbox and EME, can we perform checkin and checkout through sandbox/ Can anybody explain checkin and checkout? Sandboxes are work areas used to develop, test or run code associated with a given project. Only one version of the code can be held within the sandbox at any time. The EME Datastore contains all versions of the code that have been checked into it.A particular sandbox is associated with only one Project where as a Project can be checked out to a number of sandboxes Q.What are the Graph parameter? Answer: The graph paramaters are one which are added to the respective graph. You can added the graph parameters by selecting the edit>parameters from the menu tab. Here's the example for the graph parameters. If you want to run a same graph for n number of files in a directory, You can assign a graph parameter to the input file name and you can supply the paramter value from the script before invoking the graph. How to Schedule Graphs in Ab Initio, like workflow Schedule in Informatica? And where we must is Unix shell scripting in Ab Initio? Q.How to Improve Performance of graphs in Ab initio? Give some examples or tips. There are so many ways to improve the performance of the graphs in Ab initio. Here are few points Use MFS system using Partion by Round by robin. .If needed use lookup local than lookup when there is a large data. Takeout unnecessary components like filter by exp instead provide them in reformat/Join/Rollup. Use gather instead of concatenate. Tune Max_core for Optional performance. Try to avoid more phases. Go Parallel as soon as possible using Ab Initio Partitioning technique. Once Data Is partitioned do not bring to serial , then back to parallel. Repartition instead. For Small processing jobs serial may be better than parallel. Do not access large files across NFS, Use FTP component Use Ad Hoc MFS to read many serial files in parallel and use concat coponenet. Using Phase breaks let you allocate more memory to individual component and make your graph run faster Use Checkpoint after the sort than land data on to disk Use Join and rollup in memory feature Best performance will be gained when components can work with in memory by MAX CORE. MAR CORE for SORT is calculated by finding size of input data file. For In memory join memory needed is equal to non driving data size + overhead. If in memory join cannot fir its non driving inputs in the provided MAX CORE then it will drop all the inputs to disk and in memory does not make sence. Use rollup and Filter by EX as soon as possible to reduce number of records. When joining very small dataset to a very large dataset, it is more efficient to broadcast the small dataset to MFS using broadcast component or use the small file as lookup. Use MFS, use Round robin partition or load balance if you are not joining or rollup Filter the data in the beginning of the graph. Take out unnecessary components like filter by expression instead use select expression in join, rollup, reformat etc Use lookups instead of joins if you are joining small tale to large table. Take out old components use new components like join instead of math merge . Use gather instead of concat Use Phasing if you have too many components Tune the max core for optimal performance Avoid sorting data by using in memory for smaller datasets join Use Ab Initio layout instead of database default to achieve parallel loads Change AB_REPORT parameter to increased monitoring duration ( ) Use catalogs for reusability Use sort after partition component instead of before. Partition the data as early as possible and departition the data as late as possible. Filter unwanted fields/records as early as possible. Try to avoid the usage of join with db component. Q.How does force_error function work ? If we set never abort in reformat , will force_error stop the graph or will it continue to process the next set of records ? Answer: force_error as the name suggests it works on as to force an error in case of not meeting of any conditions mentioned.The function can be used as per the requirement. If you want to stop execution of graph in case of not meeting a specific condition say you have to compare the input and out put records reconciliation and the graph should fail if the input record count is not same as output record count "THEN set the reject-threshold to Abort on first reject" so that the graph stops. Note:- force_error directs all the records meeting the condition to reject port with the error message to error port. In certain special circumstances you can also use to treat the reject port as an additional data flow path leaving the component.When using force_error to direct valid records to the reject port for separate processing you must remember that invalid records will also be sent there. Q.What are the most commonly used components in a Ab inition graph?can anybody give me a practical example of a trasformation of data, say customer data in a credit card company into meaningful output based on business rules? Answer: The most commonly used components in to any Ab Initio project are input file/output file input table/output table lookup file reformat,gather,join,runsql,join with db,compress components,sort,trash,partition by expression,partition by key ,concatinate Q.How to work with parameterized graphs? Answer: One of the main purpose of the parameterized graphs is that if we need to run the same graph for n number of times for different files, we set up the graph parameters like $INPUT_FILE, $OUTPUT_FILE etc and we supply the values for these in the Edit>parameters.These parameters are substituted during the run time. we can set different types of parameters like positional, keyword, local etc. The idea here is, instead of maintaining different versions of the same graph, we can maintain one version for different files. Q.What is the use of unused port in join component? Answer: While joining two input flows, records which match the join condition goes to output port and we can get the records which do not meet the join condition at unused ports. Q.What is meant by dedup Sort with null key? Answer: If we don't use any key in the sort component while using the dedup sort, then the output depends on the keep parameter. It considers whole records as one group first - only the first record last - only last record unique_only - there will be no records in the output file. Q.Hi can anyone tell me what happens when the graph run? i.e The Co-operating System will be at the host, We are running the graph at some other place. How the Co-operating System interprets with Native OS? Answer: CO-operating system is layered on the top of the native OS When a graph is executed it has to be deployed in host settings and connection method like rexec, telnet, rsh, rlogin This is what the graph interacts with the co>op. when ever you press Run button on your GDE,the GDE genarates a script and the genarated script will be transfered to your host which is specified in to your GDE run settings. then the Co>operating system interprets this script and executes the script on different mechins(if required) as a sub process(threads),after compleation of each sub process,these sub_processes will return status code to main process this main process in tern returns error or sucess code of the job to GDE Q. Difference between conventional loading and direct loading? When it is used in real time. Answer: Conventional Load: Before loading the data, all the Table constraints will be checked against the data. Direct load:(Faster Loading) All the Constraints will be disabled. Data will be loaded directly. Later the data will be checked against the table constraints and the bad data won't be indexed. Api conventional loading utility direct loading. Q.explain the environment varaibles with example.? Answer: Environemental variables server as global variables in unix envrionment. They are used for passing on values from a shell/ process to another. They are inherited by Abinitio as sandbox variables/ graph parameters like AI_SORT_MAX_CORE AI_HOME AI_SERIAL AI_MFS etc. To know what all variables exist, in your unix shell, find out the naming convention and type a command like | grep . This will provide you a list of all the variables set in the shell. You can refer to the graph parameters/ components to see how these variables are used inside Abinitio. Q.How to find the number of arguments defined in graph ? Answer: List of shell arguments $*. then what is $# and $? ... $# - No of positional parameters $? - the exit status of the last executed command Q.How many numbers of inputs join component support ? Answer: Join will support maximum of 60 inputs and minimum is 2 inputs. Q.What is max-core? What are the Components that use MAX_CORE? Answer: The value of the MAX_CORE parameter is that it determines the maximum amount of memory, in bytes, that a specified component will use. If the component is running in parallel, the value of MAX_CORE represents the maximum memory usage per partition. If MAX_CORE is set too low the component will run slower than expected. Too high and the component will use too many machine resources and slow up Dramatically. The Max core parameter can be defined in the following components: SCAN in-memory SCAN ROLLUP in-memory ROLLUP in-memory JOIN SORT Whenever these components are used and have the component set to parameter set to “In-memory; Inputs need not be sorted”, a max-core variable must be specified. Q.What does dependency analysis mean in Ab Initio? Answer : Dependency Analysis It analyses the Project for the dependencies within and between the graphs. The EME examines the Project and develops a survey tracing how data is transformed and transferred field by field from component to component. Dependency analysis has two basic steps: Translation Analysis Analysis Level: In the check in wizard’s advanced options, the analysis level can be specified as one of the following: None: No dependency analysis is performed during the check in. Translation only: Graph being checked in is translated to data store format but no error checking is done. This is the minimum requirement during check in. Translation with checking: (Default) Along with the translation, errors, which will interfere with dependency analysis, are checked for. These include: Absolute paths Undefined parameters dml syntax errors Parameter reference to objects that can’t be resolved Wrong substitution syntax in parameter definition Full Dependency Analysis: Full dependency analysis is done during check in. It is not recommended as takes a long time and in turn can delay the check in process. What to analyse: All files: Analyse all files in the Project All unanalysed files: Analyse all files that have been changed or which are dependent on or required by files that have changed since the last time they were analysed. Only my checked in files: All files checked in by you would be analysed if they have not been before. Only the file specified: Apply analysis to the file specified only. Q.what is the difference between .dbc and .cfg file? Answer: .cfg file is for the remote connection and .dbc is for connecting the database. .cfg contains : The name of the remote machine The username/pwd to be used while connecting to the db. The location of the operating system on the remote machine. The connection method. .dbc file contains : The database name Database version Userid/pwd Database character set and some more. Q.What are the Graph parameter? Answer: There are 2 types of graph parameters in AbInitio 1. local parameter 2. Formal parameters.(those parameters working at runtime) Q.How many types of joins are in Ab-Initio? Answer: Join is based on a match key for inputs, Join components describes out port, unused ports, reject ports and log port. Inner Joins: The most common case is when join-type is Inner Join. In this case, if each input port contains a record with the same value for the key fields, the transform function is called and an output record is produced. If some of the input flows have more than one record with that key value, the transform function is called multiple times, once for each possible combination of records, taken one from each input port.Whenever a particular key value does not have a matching record on every input port and Inner Join is specified, the transform function is not called and all incoming records with that key value are sent to the unused ports. Full Outer Joins: Another common case is when join-type is Full Outer Join: if each input port has a record with a matching key value, Join does the same thing as it does for an Inner Join. If some input ports do not have records with matching key values, Join applies the transform function anyway, with NULL substituted for the missing records. The missing records are in effect ignored. With an Outer Join, the transform function typically requires additional rules (as compared to an Inner Join) to handle the possibility of NULL inputs. Explicit Joins: The final case is when join-type is Explicit. This setting allows you to specify True or False for the record-required n parameter for each in n port. The settings you choose determine when Join calls the transform function. The join-type and record-required n Parameters The two intersecting ovals in the diagrams below represent the key values in the records on the two ports — in0 and in1 — that are the inputs to join: For each possible setting of join-type or (if join-type is Explicit) combination of settings for record-required n, the shaded region of each of the following diagrams represents the inputs for which Join calls the transform. Join ignores the records that have key values represented by the white regions, and consequently those records go to the unused port. Q.what is semi-join ? Answer: A left semi-join on two input files, connected to ports in0 and in1 is the Inner Join .The dedup0 parameter is set to Do not dedup this input, but dedup1 is set to Dedup this input before joining. Duplicates were removed from only the in1 port, that is, from Input File 2. semijoins can be achieved by using the join component with parameter Join Type set to explicit join and the parameters recordrequired0,recordrequired1 set one to true and the other false depending on whether you require left outer or right outer join. in abinitio,there are 3 types of join... 1.inner join. 2.outer join and 3.semi join. for inner join 'record_requiredn' parameter is true for all in ports. for outer join it is false for all the in ports. if u want the semi join u put 'record_required n' as true for the required component and false for other components.. Q.How to do we run sequences of jobs? like output of A JOB is Input to B how do we co-ordinate the jobs ? Answer: By writing the wrapper scripts we can control the sequence of execution of more than one job. Q.How would you do performance tuning for already built graph ? Can you let me know some examples? Answer: example :- 1.)suppose sort is used in fornt of merge component its no use of using sort ! because we have sort component built in merge. 2) we use lookup instead of JOIN,Merge Component. 3.) suppose we want to join the data coming from 2 files and we don’t want duplicates we will use union function instead of adding additional component for duplicate remover. Q.What is the relation between EME , GDE and Co-operating system ? Answer: EME is said as enterprise metadata env, GDE as graphical development env and Co-operating system can be said as abinitio server relation b/w this CO-OP, EME AND GDE is as follows Co operating system is the Abinitio Server.This co-op is installed on particular O.S platform that is called NATIVE O.S .coming to the EME, its i just as repository in informatica , its hold the metadata,transformations,db config files source and targets information. coming to GDE its is end user environment where we can develop the graphs(mapping just like in informatica) designer uses the GDE and designs the graphs and save to the EME or Sand box it is at user side where EME is at server side. Q.When we use Dynamical DML? Answer: Dynamic DML is used if the input meta data can change. Example: at different time different input files are received for processing which have different dml. in that case we can use flag in the dml and the flag is first read in the input file received and according to the flag its corresponding dml is used. Q.Explain the differences between Replicate and BROADCAST? Answer: Replicate takes records from input flow arbitrarily combines and gives to components which connected to its output port.Broadcast is partition component copies the input record to components which connected to its output port.Consider one example,input file contains 4 records and level of parallelism is 3 then Replicate gives 4 records to each component connected to it's out port whereas Broadcast gives 12 records to each component connected to it's out port. Q.How do you truncate a table? Answer: From Abinitio run sql component using the DDL truncate table By using the Truncate table component in Ab Initio. Q.How to get DML using Utilities in UNIX? Answer: By using the command m_db gendml -table Q.Explain the difference between REFORMAT and Redefine FORMAT? Answer: Reformat changes the record format by adding or deleting fields in the DML record. Length of the record can be changed. Redefine copies it's input flow to it's out port without any transform. Redefine is used to rename the fields in the DML. But Length of record should not change. Q.How to work with parameterized graphs? Answer: Parameterized graphs specifies everything through parameters. i.e,data locations in input/output files,DMLs etc... Q.What is driving port? When do you use it? Answer: When you set the sorted-input parameter of "JOIN" component to "In memory: Input need not be sorted", you can find the driving port. Generally driving port use to improve performance in a graph. The driving input is the largest input. All other inputs are read into memory. For example, suppose the largest input to be joined is on the in1 port. Specify a port number of 1 as the value of the driving parameter. The component reads all other inputs to the join — for example, in0, and in2 — into memory. Default is 0, which specifies that the driving input is on port in0. Join also improves performance by loading all records from all inputs except the driving input into main memory. driving port in join supplies the data that drives join . That means, for every record from the driving port, it will be compared against the data from non driving port. We have to set the driving port to the larger dataset sothat non driving data which is smaller can be kept in main memory for speedingup the operation. Contact for Abinitio Online Training Q.How can we test the ab-Initio manually and automation? Answer: By running a graph through GDE is manual test. By running a graph using deployed script is automated test. Q.What is the difference between partitioning with key and round robin? Answer: Partition by Key or hash partition -> This is a partitioning technique which is used to partition data when the keys are diverse. If the key is present in large volume then there can large data skew. But this method is used more often for parallel data processing. Round robin partition is another partitioning technique to uniformly distribute the data on each of the destination data partitions. The skew is zero in this case when no of records is divisible by number of partitions. A real life example is how a pack of 52 cards is distributed among 4 players in a round-robin manner Q.what is skew and skew measurement? Answer: skew is the measure of data flow to each partition . suppose i/p is coming from 4 files and size is 1 gb 1 gb= ( 100mb+200mb+300mb+5oomb) 1000mb/4= 250 mb (100- 250 )/500= --> -150/500 == cal ur self it wil come in -ve value. Cal clu for 200,500,300. +ve value of skew is all ways desirable. skew is a indericet measure of graph. Q.What is error called 'depth not equal'? Answer: When two components are linked together if their layout does not match then this problem can occur during the compilation of the graph. A solution to this problem would be to use a partitioning component in between if there was change in layout. Latest Ab initio Interview Questions Ab initio Interview Questions Pdf Q.What is the function you would use to transfer a string into a decimal? Answer : For converting a string to a decimal we need to typecast it using the following syntax, out.decimal_field :: ( decimal( size_of_decimal ) ) string_field; The above statement converts the string to decimal and populates it to the decimal field in output. Q.Which one is faster for processing fixed length dmls or delimited dmls and why? Answer: Fixed length,because for delimited dml it has to check for delimiter every time but for fixed length dml directly length will b taken. Q.What are kinds of layouts does ab-Initio supports? Answer: Ab-Initio supports two kinds of Layouts: Serial Layout Multi layout. In Ab-Initio Layout tells which component should run where and it also gives level of parallelism. For serial Layout,level of parallelism is 1. For Multi layout,Level of parallelism depends on data partition. Q.How can you run a graph infinitely? Answer: To run a graph infinitely, The end script of the graph should call the .ksh file of the graph. Thus if the name of the graph is abc.mp then in the end script of the graph there should be a call to abc.ksh. Then this graph will run infinitely. Run the deployed script in a loop infinitely. Q.what is local and formal parameter ? Answer: Two are graph level parameters but in local you need to initialize the value at the time of declaration where as globle no need to initialize the data it will promt at the time of running the graph for that parameter. local parameter is like local variable in c language where as formal parameter is like command line argument we need to pass at run time. Q.what is BRODCASTING and REPLICATE ? Answer:Broadcast can do everything that replicate does broadcast can also send singlt file to mfs with out splitiong and brodcast makes multiple copies of single file mfs. Replicate combines data rendomly, receives in single flow and write a copy of that flow in each of output flow. replicate generates multiple straight flows as the output where as broadcast results single fanout flow. replicate improves component parallelism where as broadcast improves data parallelism. Broadcast - Takes data from multiple inputs, combines it and sends it to all the output ports. Eg - You have 2 incoming flows (This can be data parallelism or component parallelism) on Broadcast component, one with 10 records & other with 20 records. Then on all the outgoing flows (it can be any number of flows) will have 10 + 20 = 30 records Replicate - It replicates the data for a particular partition and send it out to multiple out ports of the component, but maintains the partition integrity. Eg - Your incoming flow to replicate has a data parallelism level of 2. with one partition having 10 recs & other one having 20 recs. Now suppose you have 3 output flos from replicate. Then each flow will have 2 data partitions with 10 & 20 records respectively. Q.what is the importance of EME in abinitio? Answer: EME is a repository in Ab Inition and it used for checkin and checkout for graphs also maintains graph version. Q.what is m_dump Answer: It is a co-opating system's command that we use to view data from the command prompt. m_dump command prints the data in a formatted way. m_dump Q.what is the syntax of m_dump command? Answer: m_dump Q.what are differences between different GDE versions(1.10,1.11,1.12,1.13and 1.15)? Answer: what are differences between different versions of Co-op? 1.10 is a non key version and rest are key versions. There are lot of components added and revised at following versions. Q.How to run the graph without GDE? Answer: In the run directory a graph can be deployed as a .ksh file. Now, this .ksh file can be run at the command prompt as: ksh Q.What is the Difference between DML Expression and XFR Expression ? Answer: dml expression means abinitio dml are stored or saved in a file and dml describs the data interms of expressions that performs simple computations such as files, dml also contains transform functions that control data transforms,and also describs data interms of keys that specify grouping or non grouping ,that means dml expression are non embedded record format files .xfr means simply say it is non embedded transform files ,Transform function is express business rules ,local variables, statements and as well as conn between this elements and the input and the ouput fields. Q.How Does MAXCORE works? Answer: Maxcore is a temporary memory used to sort the records Maxcore is a value (it will be in Kb). Whenever a component is executed it will take that much memory we specified for execution Maxcore is the maximum memory that could be used by a component in its execution. Q.What is $mpjret? Where it is used in ab-initio? Answer: $mpjret is return value of shell command "mp run" execution of Ab-Initio graph. this is generally treated as graph execution status return value Q.What is the latest version that is available in Ab-initio? Answer: The latest version of GDE ism1.15 AND Co>operating system is 2.14 Q.What is mean by Co>Operating system and why it is special for Ab-initio ? Answer: Co-Operating systems, that itself means a lot, it's not merely an engine or interpretor. As it says, it's an operating system which co-exists with another operating system. What does that mean.... in layman's term abinitio, unlike other applications, does not sit as a layer on top of any OS? It itself has quite a lot of operating system level capabilities such as multi files, memory management and so on and this way it completely integrate with any other OS and work jointly on the available hardware resources. This sort of Synergy with OS optimize the utilization of available hardware resources. Unlike other applications (including most other ETL tools) it does not work like a layer and interprete the commands. That is the major difference with other ETL tools , this is the reason why abinitio is much much faster than any other ETL tool and obviously much much costlier as well. Q.How to take the input data from an excel sheet? Answer: There is a Read Excell component that reads the excel either from host or from local drive. The dml will be a default one. Through Read Excel component in $AB_HOME we can read excell directly. Q.How will you test a dbc file from command prompt ?? Answer: You can test a dbc file from command prompt(Unix) using m_db test command which gives the checking of data base connection, version of data base, user Q.Which one is faster for processing fixed length dmls or delimited dmls and why? Answer: Fixed length DML's are faster because it will directly read the data of that length without any comparisons but in delimited one,s every character is to be compared and hence delays Q.what are the contineous components in Abinitio? Answer: Contineous components used to create graphs,that produce useful output file while running continously Ex:- Contineous rollup,Contineous update,batch subscribe Q.How can I calculate the total memory requirement of a graph? Answer: You can roughly calculate memory requirement as: Each partition of a component uses:~ 8 MB + max-core (if any) Add size of lookup files used in phase (if multiple components use same lookup only count it once) Multiply by degree of parallelism. Add up all components in a phase; that is how much memory is used in that phase. Add size of input and output datasets(Total memory requirement of a graph) > (the largest-memory phase in the graph). Q.What is multistage component? Answer: Multistage component are nothing but the transform components where the records are transformed into five stages like input selection, temporary records initialization, processing , finalization and output selection. examples of multistage components are like Rollup Scan Normalize Denormalize sorted. Q.what is the use of aggregation when we have rollup as we know rollup component in ab-Initio is used to summarize group of data record. then where we will use aggregation ? Answer:Rollup has a good control over record selection grouping and aggregation as compared to that of aggregate. Rollup is an updated version of aggregate. When Rollup is in template mode ,it has aggregation functions to use. So it is better to go for Rollup. Q.Phase verses Checkpoint ? Answer: Difference between a phase and checkpoint . phases are used to break up a graph so that it does not use up all the memory , it limits the number of active components thus reduce the number of components running in parallel hence improves the performance .Phases make possible the effective utilization of the resources such as memory disk space and CPU So when we have memory consuming components in the straight flow and the data in flow is in millions we can separate the process out in one phase so as the CPU allocation is more for the process to consume less time for the whole process to get over. Temporary files created during a phase will be deleted after completion of that phase. Don't put phase after Replicate,sort,across all to all flows and temporary files. Check points are used for the purpose of recovery. In contrary Checkpoints are like save points .These are required if we need to run the graph from the saved last phase recovery file(phase break checkpoint) if it fails unexpectedly. At job start,output datasets are copied into temporary files and after the completion of check pointing all datasets and job state are copied into temporary files. so if any failure occurs job can be run from last committed check point. Use of phase breaks which includes the checkpoints would degrade the performance but ensures save point run. The major difference between these two is that phasing deletes the intermediate files made at the end of each phase as soon as it enters the next phase. On the other hand what check pointing does is...it stores these intermediate files till the end of the graph. Thus we can easily use the intermediate file to restart the process from where it failed. But this cannot be done in case of phasing. We can have phases without check points. We can not assign checkpoints without phases. Q.In Ab-Initio, How can you display records between 50-75.. ? Answer: Input dataset having 100 records. I want records between 50-75 then use m_dump -start 50 -end 75 For serial and mfs there are many ways the components can be used. 1.Filter by Expression : use next_in_sequence() >50 && next_in_sequence() Op then you can try an alternate. Say suppose the input file is : file 1 Use the Run program component in GDE and write the below command: `sed -n50 75p file 1 > file 2` Q.What is the order of evaluation of parameters? Answer: When you run a graph, parameters are evaluated in the following order The host setup script is run.Common (i.e, included) sandbox parameters are evaluated. Sandbox parameters are evaluated. The project-start.ksh script is run. Graph parameters are evaluated. The graph Start Script is run. The execution of process is run simultaneously based component’s layouts. The Lookup files is run The graph Meta data is checking process. The in/out file paths with files are checking. The graph runs as order of phase0, phase1, phase2,.. Q.How do you convert 4-way MFS to 8-way mfs? Answer: By partitioning. we can use any partition method to partition. Partitioning methods are: Partition by Round-robin Broadcast Partition by Key Partition by Expression Partition by Range Partition by Percentage Partition by Load Balance Q.For data parallelism,we can use partition components. For component parallelism,we can use replicate component.Like this which component(s) can we use for pipeline parallelism? Answer:When connected sequence of components of the same branch of graph execute concurrently is called pipeline parallelism. Components like reformat where we distribute input flow to multiple o/p flow using output index depending on some selection criteria and process those o/p flows simultaneously creates pipeline parallelism. But components like sort where entire i/p must be read before a single record is written to o/p can not achieve pipeline parallelism. Q.what is meant by fancing in abinitio ? Answer:The word Abinitio means from the beginning. did you mean "fanning" ? "fan-in" ? "fan-out" ? Q.how to retrive data from database to source in that case whice componenet is used for this? Answer:To unload (retrive) Data from the database DB2, Informix, or Oracle we have components like Input Table and Unload DB Table by using these two components we can unload data from the database Q.what is the relation between EME , GDE and Co-operating system ? Answer: EME is said as enterprise metdata env, GDE as graphical devlopment env and Co-operating sytem can be said as asbinitio server relation b/w this CO-OP, EME AND GDE is as fallows Co operating system is the Abinitio Server. this co-op is installed on perticular O.S platform that is called NATIVE O.S .comming to the EME, its i just as repository in informatica , its hold the metadata,trnsformations,db config files source and targets informations. comming to GDE its is end user envirinment where we can devlop the graphs(mapping just like in informatica) desinger uses the GDE and designs the graphs and save to the EME or Sand box it is at user side.where EME is ast server side. Q.what is the use of aggregation when we have rollup as we know rollup component in abinitio is used to summirize group of data record. then where we will use aggregation ? Answer: Aggregation and Rollup both can summerise the data but rollup is much more convenient to use. In order to understand how a particular summerisation being rollup is much more explanatory compared to aggregate. Rollup can do some other functionalities like input and output filtering of records. Q.what are kinds of layouts does ab initio supports Answer: Basically there are serial and parallel layouts supported by AbInitio. A graph can have both at the same time. The parallel one depends on the degree of data parallelism. If the multi-file system is 4-way parallel then a component in a graph can run 4 way parallel if the layout is defined such as it's same as the degree of parallelism. Q.How can you run a graph infinitely? Answer:To run a graph infinitely, the end script in the graph should call the .ksh file of the graph. Thus if the name of the graph is abc.mp then in the end script of the graph there should be a call to abc.ksh. Like this the graph will run infinitely. Q.How do you add default rules in transformer? Answer: Double click on the transform parameter of parameter tab page of component properties, it will open transform editor. In the transform editor click on the Edit menu and then select Add Default Rules from the drop down. It will show two options - 1) Match Names 2) Wildcard. Q.Do you know what a local lookup is? Answer: If your lookup file is a multifile and partioned/sorted on a particular key then local lookup function can be used ahead of lookup function call. This is local to a particular partition depending on the key. Lookup File consists of data records which can be held in main memory. This makes the transform function to retrieve the records much faster than retrieving from disk. It allows the transform component to process the data records of multiple files fastly. Q.What is the difference between look-up file and look-up, with a relevant example? Answer: Generally Lookup file represents one or more serial files(Flat files). The amount of data is small enough to be held in the memory. This allows transform functions to retrive records much more quickly than it could retrive from Disk. A lookup is a component of abinitio graph where we can store data and retrieve it by using a key parameter. A lookup file is the physical file where the data for the lookup is stored. Q.how to handle if DML changes dynamically in abinitio Answer: If the DML changes dynamically then both dml and xfr has to be passed as graph level parameter during the runtime. By parametrization or by conditional record format or by metadata Q.Explain what is lookup? Answer: Lookup is basically a specific dataset which is keyed. This can be used to mapping values as per the data present in a particular file (serial/multi file). The dataset can be static as well dynamic ( in case the lookup file is being generated in previous phase and used as lookup file in current phase). Sometimes, hash-joins can be replaced by using reformat and lookup if one of the input to the join contains less number of records with slim record length. AbInitio has built-in functions to retrieve values using the key for the lookup. Q.What is a ramp limit? Answer: The limit parameter contains an integer that represents a number of reject events . The ramp parameter contains a real number that represents a rate of reject events in the number of records processed. no of bad records allowed = limit + no of records*ramp. ramp is basically the percentage value (from 0 to 1) This two together provides the threshold value of bad records. Q.Have you worked with packages? Answer: Multistage transform components by default uses packages. However user can create his own set of functions in a transfer function and can include this in other transfer functions. Q.Have you used rollup component? Describe how. Answer: If the user wants to group the records on particular field values then rollup is best way to do that. Rollup is a multi-stage transform function and it contains the following mandatory functions. 1. initialise 2. rollup 3. finalise Also need to declare one temporary variable if you want to get counts of a particular group. For each of the group, first it does call the initialise function once, followed by rollup function calls for each of the records in the group and finally calls the finalise function once at the end of last rollup call. Q.How do you add default rules in transformer? Answer: In case of reformat if the destination field names are same or subset of the source fields then no need to write anything in the reformat xfr unless you dont want to use any real transform other than reducing the set of fields or split the flow into a number of flows to achive the functionality. 1)If it is not already displayed, display the Transform Editor Grid. 2)Click the Business Rules tab if it is not already displayed. 3)Select Edit > Add Default Rules. Add Default Rules — Opens the Add Default Rules dialog. Select one of the following: Match Names — Match names: generates a set of rules that copies input fields to output fields with the same name. Use Wildcard (.*) Rule — Generates one rule that copies input fields to output fields with the same name. Q.What is the difference between partitioning with key and round robin? Answer: Partition by Key or hash partition -> This is a partitioning technique which is used to partition data when the keys are diverse. If the key is present in large volume then there can large data skew. But this method is used more often for parallel data processing. Round robin partition is another partitioning technique to uniformly distribute the data on each of the destination data partitions. The skew is zero in this case when no of records is divisible by number of partitions. A real life example is how a pack of 52 cards is distributed among 4 players in a round-robin manner. If you have some 30 cards taken at random from 52 card pack-------If take the card color as key(red or white) and distribute then the no of cards in each partion may vary much.But in Round robin , we distribute with block size , so the variation is limited to the block size Partition by Key - Distribute according to the key value Partition by Round Robin - Distribute a predefined number of records to one flow and then the same numbers of records to the next flow and so on. After the last flow resumes the pattern and almost evenly distributes the records... This patter is called round robin fashion. Q.How do you truncate a table? (Each candidate would say only 1 of the several ways to do this.) Answer: From Abinitio run sql component using the DDL "trucate table” By using the Truncate table component in Ab Initio There are many ways to do it. Probably the easiest way is to use Truncate Table Run Sql or update table can be used to do the same thing Run Program Q.Have you eveer encountered an error called "depth not equal"? (This occurs when you extensively create graphs it is a trick question) Answer: When two components are linked together if their layout doesnot match then this problem can occur during the compilation of the graph. A solution to this problem would be to use a partitioning component in between if there was change in layout. have talked about a situation where you have linked 2 components - each of them having different layouts. Think about a situation where the components on the left hand side is linked to a serial dataset and on the right hand side the downstream component is linked to a multifile. Layout is going to be propagaed from naghbours. So without any partitioning component the jump in the depth cannot be achieved and I suppose you must need one partitioning component which can help alleviate this depth discrepancy. Q.What is the function you would use to transfer a string into a decimal? In this case no specific function is required if the size of the string and decimal is same. Just use decimal cast with the size in the transform function and will suffice. For example, if the source field is defined as string(8) and the destination as decimal(8) then (say the field name is field1). out.field :: (decimal(8)) in.field If the destination field size is lesser than the input then use of string_substring function can be used likie the following. say destination field is decimal(5). out.field :: (decimal(5))string_lrtrim(string_substring(in.field,1,5)) /* string_lrtrim used to trim leading and trailing spaces */ Hope this solution works. Q.How many parallelisms are in Abinitio? Please give a definition of each. Answer: There are 3 kinds of Parallelism: 1) Data Parallesim 2)Componnent Paralelism 3) Pipeline. When the data is divided into smalll chunks and processed on different components simultaneously we call it DataParallelism When different components work on different data sets it is called Component parallelism When a graph uses multiple components to run on the same data simultaneously we call it Pipeline parallelism Q.What is multi directory? Answer:A multi directory is a parallel directory that is composed of individual directories, typically on different disks or computers. The individual directories are partitions of the multi directory. Each multi directory contains one control directory and one or more data directories. Multi files are stored in multi directories. Q.What is multi file? Answer: A multi file is a parallel file that is composed of individual files, typically on different disks or computers. The individual files are partitions of the multi file. Each multi file contains one control partitions and one or more data partitions. Multi files are stored in distributed directories called multi directories. This diagram shows a multi directory and a multi file in a multi file system: The data in a multi file is usually divided across partitions by one of these methods: Random or round robin partitioning Partitioning based on ranges or functions Replication or broadcast, in which each partition is an identical copy of the serial data. Q.What is mean by GDE, SDE? What is purpose of GDE, SDE? Answer: GDE - Graphical Development Environment –it is used for developing the graphs SDE – Shell Development Environment, which is used for developing the korn shell script on co>operating system. Q.What is difference between Rollup and Scan ? Answer: Roll up comp: Rollup evaluates a group of input records that have the same key and then generates data records that either summarize each group or select certain information from each group. Using Rollup component can evaluates to two ways as follows: 1. Template mode 2. Expanded Mode Template Mode: This mode options evaluates using built aggregation functions alike sum, min, max, count, avg, product, first, last. Expanded Mode: This mode option can evaluates using (without aggregation functions) user defined functions alike temporary function, initialize, finalize and rollup functions in transform function propriety. Scan generates a series of cumulative summary records — such as successive year-to-date totals for groups of data records. Scan produces intermediate summary records. Rollup is for group by and Scan is for successive total. Basically, when we need to produce summary then we use scan. Rollup is used to aggregate data. Q.What is Runtime Behavior of Rollup? Answer: Roll up can supports two types of modes. 1.Template Mode: This mode options evaluates using built aggregation functions alike sum, min, max, count, avg, product, first, last. Expanded Mode: This mode option can evaluates using (without aggregation functions) user defined functions alike temporary function, initialize, finalize and rollup functions in transform function propriety. Rollup component’s performance differs from using Rollup Input is Sorted and Rollup Input is Unsorted When Rollup Input is sorted When you set the sorted-input parameter to Input must be sorted or grouped (the default), Rollup requires data records grouped according to the key parameter. If you need to group the records, use Sort with the same key specifier that you use for Rollup. It will produces sorted outputs in output port. When Rollup Input is Unsorted When you set the sorted-input parameter to In memory: Input need not be sorted, Rollup accepts un grouped input, and groups all records according to the key parameter. It does not produce sorted output. Q.How do you do rollback in Ab-Initio? Answer:Ab-Initio has supports very good recovery options for any failures at runtime and interrupted powers at development time. Development time: You can get a recovery graph file while occurred any interrupted failures at development time. At Runtime: You can get a recovery file while occurred any failures at execution of graph and you can restart the execution. The recovery file has last checkpoint information and restarts from last checkpoint onwards. you can use two ways to rollback the Ab-Initio graphs m_rollback –d -deletes all intermediate files and checkpoints Q.What is internal execution (process) of the Ab-Initio graphs in Ab-Initio co>operating system on while running the graphs? Answer:Normally the Ab-Initio Co> operating system checks relevant code compatible of GDE and Co>operating system. if you are used any lookup files in graphs. This is called lookup layout checking. The graphs are having input and output files and it checks whether the path are correct or not, given below the sequence of process has done while running the graphs. Checks lookup files layouts. Checks meta data part (this is part check whether data types are used or not and related everything) – dml checking for each component basis. Checks input files Checks output files Checks each component’s layouts Finally, it checks flow of process assigns to straight. Q.What does dependency analysis mean in Ab-Initio? Answer: dependency analysis will answer the questions regarding data linage that is where does the data comes from and what applications produced depend on this data etc.. Q.What is meant by Fencing in Ab-Initio? Answer: In Software World fencing means job controlling on priority basis. In Ab-Initio it actually refers to customized phase breaking. A well fenced graph means no matter what is source data volume process will not cough in dead locks. It actually limits the number of simultaneous processes. In Ab-Initio you need to Fence the job in some times to stop the schedule. Fencing is nothing but changing the priority of the particular job. Q.What is the function of fuse component? Answer: Fuse combines multiple input flows into a single output flow by applying a transform function to corresponding records of each flow Runtime Behavior of Fuse Fuse applies a transform function to corresponding records of each input flow. The first time the transform function executes, it uses the first record of each flow. The second time the transform function executes, it uses the second record of each flow, and so on. Fuse sends the result of the transform function to the out port. The component works as follows. The component tries to read from each of its input flows. * If all of its input flows are finished, Fuse exits. * Otherwise, Fuse reads one record from each still-unfinished input port and a NULL from each finished input port. Q.what is data skew? how can you eliminate data skew while i am using partiiion by key? Answer: The skew of a data or flow partition is the amount by which its size deviates from the average partition size expressed as a percentage of the largest partition Skew of data (partition size - avg.partition size)*100/(size of largest partition) Q.What is $mpjret? Where it is used in ab-Initio? Answer: $mpjret gives the status of a graph. U can use $mpjret in end script like if 0 -eq($mpjret) then echo success else mailx -s failed mail_id Q.What are primary keys and foreign keys? Answer: In RDBMS the relationship between the two tables is represented as Primary key and foreign key relationship.Wheras the primary key table is the parent table and foreignkey table is the child table.The criteria for both the tables is there should be a matching column. Q.What is an outer join? Answer: An outer join is used when one wants to select all the records from a port - whether it has satisfied the join criteria or not. If you want to see all the records of one input file independent of whether there is a matching record in the other file or not. then its an outer join. Q.What are Cartesian joins? Answer: joins two tables without a join key. Key should be {}. A Cartesian join will get you a Cartesian product. A Cartesian join is when you join every row of one table to every row of another table. You can also get one by joining every row of a table to every row of itself. Q.What is the difference between a DB config and a CFG file? Answer: A .dbc file has the information required for Ab Initio to connect to the database to extract or load tables or views. While .CFG file is the table configuration file created by db_config while using components like Load DB Table. Both DBC and CFG files are used for database connectivity, basically both are of similar use. The only difference is, cfg file is used for Informix Database, whereas dbc are used for other database such as Oracle or Sqlserver Q.What is the difference between a Scan component and a RollUp component? Answer: Rollup is for group by and Scan is for successive total. Basically, when we need to produce summary then we use scan. Rollup is used to aggregate data. what is local and formal parameter? Answer: Two are graph level parameters but in local you need to initialize the value at the time of declaration where as globle no need to initialize the data it will promt at the time of running the graph for that parameter. Q.How will you test a dbc file from command prompt ?? Answer: try "m_db test myfile.dbc" Q.Explain the difference between the “truncate” and "delete" commands ? Answer. Truncate :- It is a DDL command, used to delete tables or clusters. Since it is a DDL command hence it is auto commit and Rollback can't be performed. It is faster than delete. Delete:- It is DML command, generally used to delete a record, clusters or tables. Rollback command can be performed , in order to retrieve the earlier deleted things. To make deleted things permanently, "commit" command should be used. Q.How to retrive data from database to source in that case whice componenet is used for this? Answer. To unload (retrive) Data from the database DB2, Informix, or Oracle we have components like Input Table and Unload DB Table by using these two components we can unload data from the database. Q.How many components are there in your most complicated graph? Answer: This is a tricky question, number of component in a graph has nothing to do with the level of knowledge a person has. On the contrary, a proper standardized and modular parametric approach will reduce the number of components to a very few. In a well thought modular and parametric design, mostly the graphs will have 3/4 components, which will be doing a particular task and will then call another sets of graphs to do the next and so on. This way total numbers of distinct graphs will drastically come down, support and maintenance will be much more simplified. The bottom line is, there are lot more other things to plan rather than to add components. Q.Do you know what a local lookup is? Answer: This function is similar to a lookup...the difference being that this function returns NULL when there is no record having the value that has been mentioned in the arguments of the function. If it finfs the matching record it returns the complete record..that is all the fields along with their values corresponding to the expression mentioned in the lookup local function. eg: lookup_local( "LOOKUP_FILE",81) -> null if the key on which the lookup file is partitioned does not hold any value as mentioned. Local Lookup files are small files that can be accommodated into physical memory for use in transforms. Details like country code/country, Currency code/currency, forexrate/value can be used in a lookup file and mapped during transformations. Lookup files are not connected to any component of the graph but available to reformat for mapping. Q.How to Create Surrogate Key using Ab Initio? Ans. A key is a field or set of fields that uniquely identifies a record in a file or table. A natural key is a key that is meaningful in some business or real-world sense. For example, a social security number for a person, or a serial number for a piece of equipment, is a natural key. A surrogate key is a field that is added to a record, either to replace the natural key or in addition to it, and has no business meaning. Surrogate keys are frequently added to records when populating a data warehouse, to help isolate the records in the warehouse from changes to the natural keys by outside processes. Q.How to Improve Performance of graphs in Ab initio? Give some examples or tips. Ans. There are somany ways to improve the performance of the graphs in Abinitio. I have few points from my side. 1.Use MFS system using Partion by Round by robin. 2.If needed use lookup local than lookup when there is a large data. 3.Takeout unnecessary components like filter by exp instead provide them in reformat/Join/Rollup. 4.Use gather instead of concatenate. 5.Tune Max_core for Optional performance. 6.Try to avoid more phases. There are many ways the performance of the graph can be improved. 1) Use a limited number of components in a particular phase 2) Use optimum value of max core values for sort and join components 3) Minimise the number of sort components 4) Minimise sorted join component and if possible replace them by in-memory join/hash join 5) Use only required fields in the sort, reformat, join components 6) Use phasing/flow buffers in case of merge, sorted joins 7) If the two inputs are huge then use sorted join, otherwise use hash join with proper driving port 8) For large dataset don't use broadcast as partitioner 9) Minimise the use of regular expression functions like re_index in the trasfer functions 10) Avoid repartitioning of data unnecessarily Q.Describe the process steps you would perform when defragmenting a data table. This table contains mission critical data ? Answer: There are several ways to do this: 1) We can move the table in the same or other tablespace and rebuild all the indexes on the table. alter table move this activity reclaims the defragmented space in the table analyze table table_name compute statistics to capture the updated statistics. 2)Reorg could be done by taking a dump of the table, truncate the table and import the dump back into the table. Q.How do we handle if DML changing dynamically ? Answer: There are lot many ways to handle the DMLs which changes dynamically with in a single file. Some of the suitable methods are to use a conditional DML or to call the vector functionality while calling the DMLs. Q.What r the Graph parameter? Answer: There are 2 types of graph parameters in AbInitio 1. local parameter 2. Formal parameters.(those parameters working at runtime) Q.What is meant by fancing in abinitio ? Answer: The word Abinitio means from the beginning. Q.What is a ramp limit? Answer: Limit and Ramp. For most of the graph components, we can manually set the error threshold limit, after which the graph exits. Normally there are three levels of thresholds like "Never Exit" and "Exit on First Occurance", very clear from the text. They represent both the extremes. The third one is Limit along with Ramp. Limit talks about max limit where as RAMP talks in terms of percentage of processed records. For example a ramp value of 5 means, if less than 5% of the total records are rejected, continue running. If it crosses the ramp then it will come out of the graph. Typically development starts with never exit, followed by ramp and finally in production "Exit on First Occurance". Case to case basis RAMP can be used in production but definitely not a desired approach. Q.Difference between conventional loading and direct loading ? when it is used in real time ? Answer: Conventional Load: Before loading the data all the Table constraints will be checked against the data. Direct load:(Faster Loading) All the Constraints will be disabled. Data will be loaded directly. Later the data will be checked against the table constraints and the bad data won't be indexed. api conventional loading utility direct loading. Q.How do you done the unit testing in Ab-Initio? How will you perform the Ab-Initio Graphs executions? How will you increase the performance in Ab-Inito graphs? Answer: The Ab-Initio Co>operating system is handling the graph with multiple processes running simultaneously. This is primary performance. Follows the given below actions: The data separators mostly use “\307” and “\007” instead of “~”, “,” and special characters and avoids these delimiters. Because of the Ab-Initio has predefined these data separators. Avoids repeated aggregation in graphs. You calculate for required aggregation at once and stores in file calls value using parameters and then you can use this parameter where it required. Avoids the maximum number of components in graph and max core components in graphs. Don’t write any kinds looping statements in start script Mostly use the sources are flat files Q.How do you improve the performance of a graph? Answer:There are many ways the performance of the graph can be improved. Use a limited number of components in a particular phase Use optimum value of max core values for sort and join components Minimize the number of sort components Minimize sorted join component and if possible replace them by in-memory join/hash join Use only required fields in the sort, reformat, join components Use phasing/flow buffers in case of merge, sorted joins If the two inputs are huge then use sorted join, otherwise use hash join with proper driving port For large dataset don't use broadcast as partitioner Minimize the use of regular expression functions like re_index in the transfer functions Avoid repartitioning of data unnecessarily Q.How would you do performance tuning for already built graph? Answer:Steps to performance Tuning for already built graph. Understand the functionality of the Graph. Modularize(i.e,check for dependencies among components). Give Phasing. Check for correct Parallelism. Check for DB component(i.e,take required data from DB. Instead of taking whole data from DB which consumes more time and memory. Q.What is .abinitiorc ? What it contain? Answer:.abinitiorc is a file which contains the credentials to connect to host. Credentials like 1)Host IP 2)User-name 3)Password etc... This is a config file for ab-Initio - in user's home directory and in $AB_HOME/Config. It sets Ab-Initio home path, configuration variables (AB_WORK_DIR, AB_DATA_DIR, etc.), login info (id, encrypted password), login methods for hosts for execution (like EME host, etc.), etc. Q.Why might you create a stored procedure with the 'with recompile' option? Answer: Recompile is useful when the tables referenced by the stored procedure undergoes a lot of modification/deletion/addition of data. Due to the heavy modification activity the execute plan becomes outdated and hence the stored procedure performance goes down. If we create the stored procedure with recompile option, the sql server wont cache a plan for this stored procedure and it will be recompiled every time it is run. Q.What is the purpose of having stored procedures in a database? Answer:Main Purpose of Stored Procedure for reduce the network traffic and all sql statement executing in cursor so speed too high. We use Run SQL and Join with DB components to run Stored Procedures. Q.What is mean by Co>Operating system and why it is special for Ab-Initio? Answer: Co > Operating System:Layered top to the Native operating system. It converts the Ab-Initio specific code into the format, which the UNIX/Windows can understand and feeds it to the native operating system, which carries out the task. Q.How to retrieve data from database to source in that case which component is used for this? Answer: To unload (retrieve) Data from the database DB2, Informix, or Oracle we have components like Input Table and Unload DB Table by using these two components we can unload data from the database. Input Table Component use the following parameters: 1)db_config file(which contains credentials to interface with Database) 2)Database Types 3)SQL file (which contains sql queries to unload data from table(s)). Q.How to execute the graph from start to end stages?Tell me and how to run graph in non Ab-Initio system? Answer: There are so many ways to do this, 1.you can run components according to phases how you defined. 2.by creating ksh, sh scripts also you can run. Q.What is Join With DB? Answer: Join with DB Component joins records from the flow or flows connected to its in port with records read directly from a database, and outputs new records containing data based on, transform function. Q.How do you truncate a table? Answer: Use Truncate Table component to truncate a table from DB in Ab-Initio. Truncate Table Component has the following parameters: 1)db_config file(which contains credentials to interface with Database) 2)Database Types 3)SQL file (which contains sql queries to truncate table(s)). Q.Can we load multiple files? Answer: Yes,we can load multiple file in Ab-Initio. Q.What is the syntax of m_dump command? Answer: m_dump command prints the data in a formatted way. The general syntax is m_dump "m_dump meta data data " e.g m_dump emp.dml emp.dat -start 10 -end 20 – it will give record from 10 to 20 from emp.dat file. Q.How to Create Surrogate Key using Ab-Initio? Answer: A surrogate key is a substitution for the natural primary key. –It is just a unique identifier or number for each record like ROWID of an Oracle table. Surrogate keys can be created using 1)next_in_sequence 2)this_partition 3)no_of_partitions Q.Can any one give me an example of real-time start script in the graph? Answer: Start script is a script which gets executed before the graph execution starts. If we want to export values of parameters to the graph then we can write in start script then run the graph then those values will be exported to graph. Q.What is the difference between sandbox and EME, can we perform checkin and checkout through sandbox/ Can anybody explain checkin and checkout? Answer. Sandboxes are work areas used to develop, test or run code associated with a given project. Only one version of the code can be held within the sandbox at any time. The EME Datastore contains all versions of the code that have been checked into it. A particular sandbox is associated with only one Project where as a Project can be checked out to a number of sandboxes. Q.What is skew and skew measurement? Answer: skew is the mesaureof data flow to each partation . suppose i/p is comming from 4 files and size is 1 gb 1 gb= ( 100mb+200mb+300mb+5oomb) 1000mb/4= 250 mb (100- 250 )/500= --> -150/500 == cal ur self it wil come in -ve value. calclu for 200,500,300. +ve value of skew is allways desriable. skew is a indericet measure of graph. Q.What is the syntax of m_dump command? Answer: The genaral syntax is "m_dump metadata data " Q.What is the latest version that is available in Ab-initio? Answer: The latest version of GDE ism1.15 AND Co>operating system is 2.14 Q.What is the Difference between DML Expression and XFR Expression ? Answer: The main difference b/w dml & xfr is that DML represent format of the metadata. XFR represent the tranform functions.which will contain business rules Q.What are the most commonly used components in a Abinition graph? can anybody give me a practical example of a trasformation of data, say customer data in a credit card company into meaningful output based on business rules? Answer: The most commonly used components in to any Ab Initio project are input file/output file input table/output table lookup file reformat,gather,join,runsql,join with db,compress components,sort,trash,partition by expression,partition by key ,concatinate Q.Have you used rollup component? Describe how ? Answer: Rollup component can be used in different number of ways. It basically acts on a group of records based on a certain key. The simplest application would be to count the number of records in a certain file or table. In this case there would not be any "key" associated with it. A temp variable would be created for eg. 'temp.count' which would be increamented with every record ( since there is no key here all the fields are trated as one group) that flows through the transform, like temp.count=temp.count+1. Again the rollup component can be used to discard duplicates from a group.Rollup basically acting as the dedup component in this case. What is the difference between partitioning with key and round robin? Answer: PARTITION BY KEY: In this, we have to specify the key based on which the partition will occur. Since it is key based it results in very well balanced data. It is useful for key dependent parallelism. PARTITION BY ROUND ROBIN: In this, the records are partitioned in sequential way, distributing data evenly in blocksize chunks across the output partition. It is not key based and results in well balanced data especially with blocksize of 1. It is useful for record independent parallelism. Q.How to work with parameterized graphs? Answer: One of the main purpose of the parameterized graphs is that if we need to run the same graph for n number of times for different files, we set up the graph parameters like $INPUT_FILE, $OUTPUT_FILE etc and we supply the values for these in the Edit>parameters.These parameters are substituted during the run time. we can set different types of parameters like positional, keyword, local etc. The idea here is, instead of maintaining different versions of the same graph, we can maintain one version for different files. Q.How Does MAXCORE works? Answer: Maxcore is a value (it will be in Kb).Whne ever a component is executed it will take that much memeory we specified for execution. Q.What does layout means in terms of Ab Initio? Answer: Before you can run an Ab Initio graph, you must specify layouts to describe the following to the Co>Operating System: The location of files The number and locations of the partitions of multifiles The number of, and the locations in which, the partitions of program components execute A layout is one of the following: A URL that specifies the location of a serial file A URL that specifies the location of the control partition of a multifile A list of URLs that specifies the locations of: The partitions of an ad hoc multifile The working directories of a program component Every component in a graph — both dataset and program components — has a layout. Some graphs use one layout throughout; others use several layouts and repartition data as needed for processing by a greater or lesser number of processors. During execution, a graph writes various files in the layouts of some or all of the components in it. For example: An Intermediate File component writes to disk all the data that passes through it. A phase break, checkpoint, or watcher writes to disk, in the layout of the component downstream from it, all the data passing through it. A buffered flow writes data to disk, in the layout of the component downstream from it, when its buffers overflow. Many program components — Sort is one example — write, then read and remove, temporary files in their layouts. A checkpoint in a continuous graph writes files in the layout of every component as it moves through the graph. Q.Can we load multiple files? Answer: Load multiple files from my perspective means writing into more than one file at a time. If this is the same case with you, Ab initio provides a component called Write Multiplefiles (in dataset Component group) which can write multiple files at a time. But the files which are to be written must be local files i.e., they should reside in your local PC. For more information on this component read in help file. Q.How would you do performance tuning for already built graph ? Can you let me know some examples? Answer: example :- suppose sort is used in fornt of merge component its no use of using sort ! 1)we have sort component built in merge. 2) we use lookup instead of JOIN,Merge Componenet. 3) suppose we wnt to join the data comming from 2 files and we dnt wnt dupliates we will use union funtion instead of adding addtional component for duplicate remover. Q.Which one is faster for processing fixed length dmls or delimited dmls and why ? Answer: Fixed length DML's are faster because it will directly read the data of that length without any comparisons but in delimited one,s every character is to be compared and hence delays. Q.What is the function you would use to transfer a string into a decimal? Answer: For converting a string to a decimal we need to typecast it using the following syntax, out.decimal_field :: ( decimal( size_of_decimal ) ) string_field; The above statement converts the string to decimal and populates it to the decimal field in output. Q.What is the importance of EME in ab initio? Answer: EME is a repository in Ab Inition and it used for checkin and checkout for graphs also maintains graph version. Q.How do you add default rules in transformer? Answer: Double click on the transform parameter of parameter tab page of component properties, it will open transform editor. In the transform editor click on the Edit menu and then select Add Default Rules from the dropdown. It will show two options - 1) Match Names 2) Wildcard. Q.What is data mapping and data modeling? Answer: data mapping deals with the transformation of the extracted data at FIELD level i.e. the transformation of the source field to target field is specified by the mapping defined on the target field. The data mapping is specified during the cleansing of the data to be loaded. For Example: source; string(35) name = "Siva Krishna "; target; string("01") nm=NULL("");/*(maximum length is string(35))*/ Then we can have a mapping like: Straight move.Trim the leading or trailing spaces. The above mapping specifies the transformation of the field nm. Q.Difference between conventional loading and direct loading ? when it is used in real time . Answer: Conventional Load: Before loading the data, all the Table constraints will be checked against the data. Direct load:(Faster Loading) All the Constraints will be disabled. Data will be loaded directly.Later the data will be checked against the table constraints and the bad data won't be indexed. Api conventional loading utility direct loading. Q.What are the contineous components in Abinitio? Answer: Contineous components used to create graphs,that produce useful output file while running continuously Ex:- Contineous rollup,Contineous update,batch subscribe Q.What is mean by Co > Operating system and why it is special for Ab-initio ? Answer: Co > Operating System: It converts the AbInitio specific code into the format, which the UNIX/Windows can understand and feeds it to the native operating system, which carries out the task. Q.How do you add default rules in transformer? Answer: Click to transformer then go to edit …then click to add default rule…… In Abinitio there is a concept called Rule Priority, in which you can assign priority to rules in Transformer. Let’s have a example: Ouput.var1 :1: input.var1 + 10 Ouput.var1 :2: 100 This example shows that output variable is assigned an input variable + 100 or if input variable do not have a value then default value 100 is set to the output variable. The numbers 1 and 2 represents the priority. Q.How to do we run sequences of jobs , like output of A JOB is Input to B,How do we co-ordinate the jobs? Answer: By writing the wrapper scripts we can control the sequence of execution of more than one job. Q.what is BRODCASTING and REPLICATE ? Answer: Broadcast - Takes data from multiple inputs, combines it and sends it to all the output ports. Eg - You have 2 incoming flows (This can be data parallelism or component parallelism) on Broadcast component, one with 10 records & other with 20 records. Then on all the outgoing flows (it can be any number of flows) will have 10 + 20 = 30 records Replicate - It replicates the data for a particular partition and send it out to multiple out ports of the component, but maintains the partition integrity. Eg - Your incoming flow to replicate has a data parallelism level of 2. with one partition having 10 records & other one having 20 recs. Now suppose you have 3 output flos from replicate. Then each flow will have 2 data partitions with 10 & 20 records respectively. Ab initio Interview Questions Ab initio Interview Questions and Answers Q.When using multiple DML statements to perform a single unit of work, is it preferable to use implicit or explicit transactions, and why. Answer: Because implicit is using for internal processing and explicit is using for user open data required. Q.What are kinds of layouts does ab initio supports Answer: Basically there are serial and parallel layouts supported by AbInitio. A graph can have both at the same time. The parallel one depends on the degree of data parallelism. If the multi-file system is 4-way parallel then a component in a graph can run 4 way parallel if the layout is defined such as it's same as the degree of parallelism. Q.What is the difference between look-up file and look-up, with a relevant example? Answer: A lookup is a component of abinitio graph where we can store data and retrieve it by using a key parameter. A lookup file is the physical file where the data for the lookup is stored. Q.How will you test a dbc file from command prompt? Answer: A .dbc file can be tested using m_db command eg: m_db test .dbc_filename Q.Can we merge two graphs? Answer: You can not merge two ab-Initio graphs. You can use the output of one graph as input for another. You can also copy/paste the contents between graphs. Q.Explain the differences between api and utility mode? Answer: api and Utility are Database Interfaces. api use SQL where table constrains are checked against the data before loading data into Database. Utility uses Bulk Loading where table constraints are disabled first and data loaded into Database and then table constraints are checked against data. Data loading using Utility is faster when compared to Api. if a crash occurs while loading data into database we can have commit and rollback in Api but we need to load whole in Utility mode. Q.How to Schedule Graphs in Ab-Initio,like work flow Schedule in Informatica? And where we must use Unix shell scripting in Ab-Initio? Answer: We can use Autosys, Control-M, or any other external scheduler to schedule graphs in Ab-Initio. We can take care of dependencies in many ways. For example, if scripts should run sequentially, we can arrange for this in Autosys, or we can create a wrapper script and put there several sequential commands (nohup command1.ksh & ; nohup command2.ksh &; etc). We can even create a special graph in Ab-Initio to execute individual scripts as needed. Q.What is Environment project in Ab-Initio? Answer: Environment project is a special public project that exists in every Ab-Initio environment. It contains all the environment parameters required by the private or public projects which constitute AI Standard Environment. Q.What is Component Folding?What is the use of it? Answer: Component Folding is a new feature by which Co>operating System combines a group of components and runs them as a single process. Component Folding improves the performance of graph. Pre-Requirements for Component Folding The components must be foldable. They must be in same phase and layout. Components must be connected via straight flow Q.How do you Debug a graph ,If an error occurs while running? Answer: There are many ways to debug a graph. we can use Debugger File Watcher Intermediate File for debugging purpose. Q.What do u mean by $RUN? Answer: This is parameter variable and it contains only path of project sandbox run directory. Instead of using hard-code value to use this parameter and this is default sandbox run directory parameter. fin -------> top-level directory ( $AI_PROJECT ) |---- mp -------> second level directory ($MP ) |---- xfr -------> second level directory ($XFR ) |---- run --------> second level directory ($RUN ) |---- dml -------> second level directory ($DML ) Q.What is the importance of EME in ab-Initio? Answer: EME is a repository in Ab-Initio and it used for check-in and checkout for graphs also maintains graph version. EME is source code control system in Ab-Initio world. It is repository where all the sandboxes related(project related codes(graphs version are maintained) code version are maintained , we just check-in and checkout graphs and modified it according. There will be lock put once it is access by any users. Q.What is the difference between sandbox and EME, can we perform check-in and checkout through sandbox/ Can anybody explain check-in and checkout? Answer: Sandboxes are work areas used to develop test or run code associated with a given project. Only one version of the code can be held within the sandbox at any time. The EME Data-store contains all versions of the code that have been checked into it. A particular sandbox is associated with only one Project where as a Project can be checked out to a number of sandboxes. Q.What is difference between sandbox parameters and graph parameters? Answer: Sandbox Parameters are common parameters for the project. it can be used to accessible with in a project. The graph parameters are uses with in graph but you can't access outside of other graphs. It’s called local parameters. Q.How do you connect EME to Ab-Initio Server? Answer:There are several ways of connecting to EME Set AB_AIR_ROOT GDE you can connect to EME data-store login to eme web interface using the air command, i don't know much about this. Q.What is use of co>operating system between GDE and Host? Answer: The co>operating system is heart of GDE, It always referring the host setting, environmental variable and functions while running the graphs through GDE. It's interfacing the connection setting information between HOST and GDE. Q.What is the use of Sandbox ? What is it.? Answer: Sandbox is a directory structure of which each directory level is assigned a variable name, is used to manage check-in and checkout of repository based objects such as mp, run, dml, db, xfr and sql (graphs, graph ksh files, wrapper scripts, dml files, xfr files, dbc files, sql files.) Fin -------> top-level directory ( $AI_PROJECT ) |---- mp -------> second level directory ($AI_MP ) |---- xfr -------> second level directory ($AI_XFR ) |---- run --------> second level directory ($AI_RUN ) |---- dml -------> second level directory ($AI_DML ) Sandbox contains various directories, which is used for specific purpose only. The mp directory is used for storing data mapping details about between sources and targets or components and the file extension must be *.mp. The xfr directory denotes purpose of stores the transform files and the file extension must be *.xfr. The dml directory is used for storing all meta-data information of data with Ab-Initio supported data types and the file extensions must be *.dml. The run directory contains only the graph’s shell script (korn shell script) files that are created after deploying the graph. The sandbox contains might be stores all kinds of information for data. Q.What is mean by EME Data Store and what is use of EME Data Store in Enterprise world? Answer: EME Data Store is a Enterprise Meta Environment Data store (Enterprise Repository) and its contains ’n’ number of projects (sandbox) which are interfacing the meta data between them. These sandbox project objects (mp, run, db, xfr, dml) are can be easily to manage the check-in, checked out of the repository objects. Mode: In the EME Data-store Mode box of the EME Data-store Settings dialog, choose one of the following: Source Code Control — This is the recommended setting. When you set a data-store to this mode, you must check out a project in order to work on it. This prevents multiple users from making conflicting changes to a project. Full Access — This setting is strongly not recommended. It is for advanced users only. It allows you to edit a project in the data-store without checking it out. Save Script When Graph Saved to Sandbox In the EME Data-store Settings dialog, select this option to have the GDE save the script it generates for a graph when you save the graph. The script lets you run the graph without the GDE if, for example, you relocate the project.
Continue reading
ArcSight ESM Interview Questions
Q.What Does ArcSight ESM Stand For And What Is Its Primary Use? Ans: So ArcSight ESM stands for Enterprise Security Manager. As the name itself implies the usage of this tool is that it adds value to your organization security policies. Using this tool, it will help the organizations to focus on the threat detection, analysis on the triages, compliance management. All of these are done on the SIEM platform where it actually reduces the time taken to resolve a cybersecurity threat. Q.What Does Siem Stand For And What Is It About? Ans: SIEM stands for Security Information and Event management. So this is a platform where a holistic view of the security process implemented within the organization. The letter e is silent and it is addressed as “SIM” platform. Basically, in this process, the data is all gathered into one secure repository where the logs are used for future security analysis. This process is widely used in the Payment Card Industry. It is actually classified as a data security standard in the Payment Card industry. Q.What Are The Key Features Of ArcSight Enterprise Security Manager? Ans: The key features of ArcSight Enterprise Security Manager is as follows: Enriched Security Event data Powerful real-time data visualization and correlation Automated workflows Security process optimized ArcSight Enterprise Security Manager tool is compatible with ArcSight Data Platform and ArcSight Investigate. Q.Explain How ArcSight ESM Is Protecting The Businesses Across The Globe? Ans: The following are the different ways that the business is actually protected by using ArcSight ESM tool, as follows: It is capable of collecting data or information from any type of log source It tremendously reduces the response time and also helps in reducing the damage as well It can efficiently store information where the information can be retrieved as we generally do in enterprise-level databases. It provides role relevant reports that are available within the enterprise The architecture is scalable Easily customizable and maintains a high-performance system. Q.How Does ArcSight ESM Provide Powerful Real-time Data Correlation? Answer : Well, ArcSight ESM provides powerful real-time data correlation by processing number of events per second. Based on this analysis a more accurate outcome is proposed. So based on this analysis, the threats that violate the internal rules are escalated within the platform. ESM actually processes 75,000 events per second basis. Q.What Can Be Done Using Arcsight Esm? Ans: ArcSight ESM actually helps the organizations and the individuals as below: All the event data is collected centrally and stored and monitor User-friendly compliance reporting in a single touch provides necessary data in an appropriate format. Has an ability to monitor and mitigate the risk. Eliminates manual process as much as possible Saves valuable hours of security analyst where they spend on false alarms Brings awareness to the team about the security process in place and the countermeasures implemented. Q.Why Do Organizations Need Security Information And Event Management Systems? Ans: Well, most of the small companies don't have enough manpower to make sure that their security process is intact. But they won't be able to be proactive and warn the team that there might be a possible threat attack, this is because they don't have any automatic mechanism which triggers a threat attack. So to solve the real-time issue and also make sure the security checks are monitored and analyzed, we have a Security Information and Event Management system. Out of this system is ArcSight SEM. So basically all the machine log data is analyzed and understands the patterns of normal behavior vs abnormal behavior. Thus making it a perfect tool where it can understand the security logs so far and based on the analysis can trigger some information which might prevent a bigger threat to the entire organization. Q.How Can ArcSight ESM Help Organizations In Terms Of Security Aspects? Ans: Well, ArcSight ESM can help the organizations building more enhanced use cases to improve the APT’s ( Advanced Persistent Threats)which will allow a faster and targeted response in a timely fashion. Q.What Does ArcSight Logger Do? Ans: So, ArcSight Logger is nothing but a log management solution that can be used widely in security practices. So using solution, the users will be able to capture and analyze different type of log data and provide necessary inputs to all the individual's teams so their questions are answered. Eventually, this can be expanded into an enterprise level log management solution if needed. So using this solution, topics like compliance and risk management are taken into due consideration. Also, the data can be used for searching, indexing, reporting, analysis purposes, and retention as well. Q.What Is Siem Tool, Explain Briefly? Ans: In the field of Information technology and computer security, products which provide or offer services like real-time security generated alerts analysis can be categorized as SIEM tool. Q.What Is A Soc Team? Ans: The term SOC stands for “Security Operations Center”.So basically this is a center for all the websites, applications, databases, data centers and servers, networks are duly monitored and analyzed and well defended. Q.Explain What Is The Core Offering Of ArcSight ESM? Ans: The core offering of ArcSight ESM is: Analyzes different threats to a database Checks with the logs that were captured Provide possible solutions or advice based on the risk level. Q.What Is The Main Purpose Of ArcSight Express? Ans: Basically, ArcSight Express provides the same functionalities that they do at ArcSight ESM but at a very much smaller scale. ArcSight Express analyzes threats within a database and provides possible action item. Q.What Is The Main Use Of ArcSight Logger? Ans: The main use of ArcSight Logger is to capture or stream real-time data and categorize them into different buckets of specific logs. Q.What Are The Key Capabilities Of ArcSight Logger? Ans: The key capabilities of ArcSight Logger is: It collects logs from any sort of log generating source. After collecting the data, it categorizes and registers as Common Event Format (CEF). These events can be searched with the use of a simple interface. It can handle and store years worth of logs information. It is perfect for automation analysis which can be later used for reporting, the intelligence of logs or events for IT Security purposes and logs analytics. Q.What Does ArcSight Connectors Mean? Ans: The main use of ArcSight Connectors is listed below: With the use of ArcSight connectors, the user can actually automate the process of collecting and managing the logs irrespective of the device. All the data can be normalized into a CEF, i.e. Common Event Format ArcSight connectors provide a bunch of universal data collections from different unique devices. Q.What Does ArcSight Manager Do, Explain In Brief? Ans: The use of ArcSight manager is to simply put in place robust security parameters within the organization. So it is one of the high-performance service engines which actually filters, manages, correlates all security-related events that are collected by the IT system. The main parts that are essential for the ArcSight manager to work appropriately is: ArcSight Console ACC CORR Engine ArcSight SmartConnectors The operational environment for ArcSight Manager is nothing but the underlying OS and the file system that are in place. Q.What Does Ids Stand For? Ans: IDS stands for “Intrusion Detection System”. This is the main component when it comes to ArcSight ESM. Q.Few Bullet Points On ArcSight ESM? Ans: The following are the important points about the ArcSight ESM tool: With this tool, administrators and analyst can actually detect more incidents Operate more efficiently The same data set can be used for real-time correlation of the data and log management application can use the same dataset. Q.What Are The System Requirements For Implementing ArcSight ESM? Ans: Supported Operating systems are: Red Hat Enterprise Linux Version 6.2, 64 bit CPU Memory 16-36GB Disk space for 2-4 TB Average Compression of 10:1 SAS 15K RPM. Contact for more on ArcSight Online Training
Continue reading
Cognos Interview Questions
Q.What is a Data Warehouse? Ans: Data Warehouse is a collection of data marts representing historical data from different operation data sources (OLTP). The data from these OLTP are structured and optimized for querying and analysis in a Data Warehouse. Q.Define Cognos Report Net? Ans: Cognos Report Net is the web-based, dynamic, business intelligence reporting solution from Cognos. Q.What are the tiers of the ReportNet Architecture? Ans: The ReportNet architecture can be separated into three tiers: Web server Applications Data Q.Define Business Intelligence Ans: Business Intelligence is a broad category of application programs and technology used for query, Reporting and analyzing the business multi dimensionally. Q.What is a Data Mart? Ans: Data Mart is a subset of a data warehouse that can provide data for reporting and analysis. Q.What is HPQS ? Ans: Data Marts are sometimes also called as HPQS (Higher Performance Query Structure). Q.What is multi dimension analysis? Ans: It is a technique to modify the data so that the data can be view at the different levels of details. Q.What are the responsibilities of Cognos Administrator? Ans: Cognos Administrator is assigned with following responsibilities. Installations and configurations in distributed network environment. Creating the Repository (Content Store). Perform back up and recovery of Meta Data. Developing the user Administration. Tuning the servers. Deployment. Q.Responsibility of Cognos Architect? Ans: An Architect is responsible for designing the interface by fulfilling the user requirements.Once the interface has been designed it should be rigorously tested before distributing to the end user population. Q.Roles of the Application Developer? Ans: Design the Reports according to the Report requirement templates. Testing the each report with the following types of tests. Unit Testing System Testing Performance Testing Q.What is OLAP? Ans: OLAP stands for Online Analytical Processing. It uses database tables (Fact and Dimension tables) to enable multidimensional viewing, analysis and querying of large amount of data. Q.What are the types of OLAPs? Ans: DOLAP:-The OLAP tool which works with desktop databases are called as DOLAP Ex:- FoxPro, Clipper, Dbase, Paradox. ROLAP:-The OLAP tool, which works with Relational databases, are called as ROLAP. Ex:-Oracle, SQL Server, Tera Data, DB2. MOLAP: - The OLAP tool, which works with Multidimensional databases, are called as MOLAP. Ex:- ESSBASE, Power Cube HOLAP:- The OLAP tool which works with Relational databases and Multidimensional database, are called as MOLAP. Ex:- Cognos, Business Objects. Q.What is OLTP? Ans: OLTP stands for Online Transaction Processing. Except data warehouse databases the other databases are OLTPs. These OLTP databases are designed for recording the daily operations and transactions of a business. Q.What is an Interface? Ans: An Interface is a gateway between the user and database. An Interface contains a logical pointers which pointes to data in the Data Warehouse. An Interface is isolates the complexity and diversity if data bases. A good OLAP Interface writes an efficient SQL read on accurate data from database. An Interface needs to be designed by fulfilling the report requirement. Q.What are the types of components installed while loading the Cognos Report Net Software? Ans: When we install the Cognos Report Net Software there are two types of components get installed :- Windows based components. 2. Web based components. Q.What is the security module used in cognos? Ans: The security module used in cognos is cognos access manager. Q.What is report burn and where it occurs? Ans: Report can be divided into different parts and sending these different parts of report to different users. It occurs in cognos connection. Q.What is difference between drill through and drill down? Ans: drill through allows u to navigate from summary to detailed information. drill down also similar kind of thing, for example will do drill down on year if we drill down on this the next level will come means year contains quaters, it contains months, months contains weeks, week contains days. so,we can view all the levels through drill down For example I developed some reports on employee package now I have to develop same reports to another package. Q.How can I change reports to another package? Ans: Open that report and save as that reportSelect public folder and then select package in which package u want to save then save it. Q.How to create users and permissions in cognos? Ans: Users and Permissions can be given in Access manager - Administration. Individual users can be created using their names or their Ids (if any) in Access manager and then be given permissions. This tool is loaded when you install Cognos 7.x in your machine. In content manager--we have one component called cognos process manager, using process manager we can integrate with third party tools like LDAP or NTLM. we cannot create users in reportnet directly. in real time we can use LDAP. Q.What is the difference between macros and prompt? Ans: Macro is set of instructions to run report. Prompt is just like a dialog box which will ask user to select/enter the information what he needs. A MACRO IS RUN TIME PARAMETER WHICH CAN BE PLACED IN ANY PART OF SQL SELECT STATEMENT,WHERE AS PROMPT IS USED TO GIVE CONDITIONS. Q.What is Cognos Visualizer and Cognos Scripting? Ans: Cognos scripting is like a language, with this we will create macros, Macro is a set of instructions to run a report. Cognos Visualizer is a tool for creating charts(bar, pie, histogram......) & maps using datasources(datafiles like mdc,iqd,csv,excel etc). Q.What is query subject? Ans: A query subject is the basic building block in Framework Manager. A query subject is a set of query items that have an inherent relationship. Q.What are the Ways to Import Data into Catalog? Ans: Create a catalog with .cat file... Q.Report studio has two SQL tabs one native SQL and another one cognos SQL which one is get more preference? which one we need to consider? Cognos sql Q.When we save a report in report studio with what extn it save? Ans: It saves as .XML It will save as .CRRWhen we save the report it save our specifications like u can choose(in report studio RUN MENU by clinking down icon) HTML,XML,CSV,OR PDF Format if u save the report HTML u can also run report different format as u like or different language also Depends on the format on which the report is being run. By default its .html format. The report can be made to run in any formats like HTML, PDF, XLS, CSV ( Comma separated format) and XML and based on that, report can be saved in either .html, .pdf, .xls, .csv or .xml When you r viewing the report in the package is shown with extension .crr Q.How can I convert a list report/Cross tab Report in Cognos EP series into a bar chart or pie chart etc.? Ans: This can be done in Impromptu administrator Q.How many numbers of cubes can we create on a single model? How can we navigate between those cubes? Ans: Using a single Model, we can create as many number of cubes u want. By using the dimension views...etc. Regarding the navigation, when we save cubes, these act as separator multidimensional databases. There wont be any relations. Navigate means opening the cubes separately. Q.What is the difference between drill down and drill through? Ans: Drill down means it will navigate from detail information to summary level information within report. Drill through means it will navigate from summary to detail from report to report. Q.What is the difference between group and association? Ans: In cognos 'group' is used to suppress the duplicate values. And 'associate' is used to suppress the values if it is one-one relationship. Group - It Eliminates the Duplicate values from the report break the columns. Its having a one to many relationship. Association-It Eliminates the Duplicate values from the report and its having one-one relationship. Both Group & Association eliminate duplicates in a column.But we cannot use Association alone.It should have one to one relationship with grouped column. Using Association is a performance enhancement technique. A GROUP IS USED TO ELIMINATE THE DUPLICATES IN DATA.WHERE A ASSOCIATION IS USED TO HIRE OR SUPPRESS THE VALUES IN 1 TO 1 RELATION. Q.What is Cardinality? Ans: Cardinality is nothing but relation between tables like One to One, One to Many and Many to Many. Cardinality is the nature of the relationship. 1-1,1-many relation shipsCardinality is the nature of the relation between two query items Q.How to test reports in cognos? Ans: Go to power play transformer, on toolbar select RUN menu. In RUN menu there is one option that is TEST BUILD you can test ur report there. You may also test the outcome of a report by writing and executing SQL Queries and comparing the outcome with the report outcome. BY USING SYSTEM TESTING, UNIT TESTING, PERFORMANCE TESTING Q.How u create security to reports? Ans: Basically there are two types of securities for any object Levels of Securities: Database level Security Row level security Even we can provide LDAP security too. By setting up the Governors in frame work manager also we can give the security to the data accessing through the report. Q.How to find out YEAR CALCULATION? THAT MEANS YOU SHOW THIS YEAR REVENUE AND LAST YEAR REVENUE. WHEN YOU FILTER THIS YEAR VALUE THE RESULT SHOULD DISPLAY THIS YEAR RESULT AND LAST YEAR RESULT? Ans: By using expressions, I mean by using conditional constructs and through graphs. we can view the revenue and calculated data as well. Q.How to configure the Cognos configurations to work in the Windows 2000 Server machine? Ans: Go to Cognos Configuration and see the settings for cognos in the explorer page. There u can set for win2000 server Q.What is drill down and slicing and dicing what are the difference between them? Ans: Drill down means it will explains the summary level information to lowest level information. Slicing is nothing but cutting edge of the cube. Dicing is nothing but viewing the cube in all possible directions. Drill down is the way to get the more detailed data. Slicing and dicing is to get the data according to where clause. Q.What are the enhancements in Cognos Reportnet ? Ans: Enhancements in Cognos reportnet is Cognos 1.0, 1.1, MR1.1, MR1.2 Cognos 1.0,1.1, MR1.1, MR1.2 AND cognos 8, 8.1.1MR, 8.1.2MR, now new version is 8.2 Q.Enhancement in cognos in the list report:- Ans: 1) Apply list column title styles 2) Apply list column body styles 3) Apply table styles Q.Explain the different stages in creating a report in cognos report net? Ans: Open cognos connection in that select Report studio or query studio it will displays there u have packages(ex: gosales and goretailr (defult)) click on that it display Report studio in that select Object pane select required query subject then click run it displays ur report on report viewer screen. Q.What is Report item? Ans: Report item is nothing but a query item when it is drag and drop into the work area. Example in Go sales and retailers package-> Product (is a query subject) and-> Product line is( a query item)->when PL is dropped into the work area then its a report item. Q.How to perform single sign on in Cognos by using URL? Ans: In cognos configuration under authentication allow anonymous access should be false.In cgi-bin properties (under iis) the enable anonymous access should be false. Q.What is loop in framework manager? Ans: Loop is closed path in reportnet it called as ambiguous relationship. That means a query subject contains multiple paths to retrieve the data. It is an exception to resolve to create a shortcut on that query subject otherwise it displays wrong results in reports and performance is degrades. An undefined join between two tables is known as loop.To resolve loop delete the joins, if these joins are necessary then we have to create shortcuts nothing but alias tables. Place the joins in alias tables. A LOOP IS A CLOSED PATH IN FRAME WORK MANAGER DUE TO JOINS. WE ILL RESOLVED IT BY CREATING ALLIAS AND CREATING SHORT CUTS. Q.How you migrate reports to cognos 8 from previous versions? Ans: Migration means report net reports to cognos8.rn Upgrade .bat/rs upgrade .sh is a standalone utility that upgrades a single report specification at a time out side of the cognos. Q.What are the filters in Framework Manager and Report Studio? Ans: Filters in framework manager are Standalone filters Embedded filters Report studio Filters are Detail filters Summary filters Q.How u migrate the reports from impromptu to reportnet??? Ans: It's possible to to migrate impromptu reports to reportnet using the migration tool(own by reportnet 1.1)Using this syntax: migratetocrn HERE crn----Cognos RoportNet Q.What is IQD? What is contained in IQD? Ans: IQD stands for Impromptu query definition. It is report save extension with .iqd it is use for creating a cube in power play transformer IQD is Impromptu Query Definition. It contains the SQL query to display the data from the database and SQL query is automatically created based on the columns selected from database. IQD MEANS IMPROMPTU QUERY DEFFINITION,IT CONTAINS MULTIDIMENTIONAL DATA STRUCTURE WITH ARRAY FORMAT FOR CREATING CUBES. THIS REPORTS CAN BE SAVED WITH .IQD FILES. Q.How we check the errors before running the report, Plz let me know the answer? Ans: Before u run a report. U have an option called 'valididate report' in run menu.. Then u can find the errors what u made. Q.What are the special features in COGNOS REPORTNET? Ans: CRN is web based tool. So it will very useful to view the reports very easily. So that they preferred CRN Q.How you create IQD In ReportNet Framework? Ans: Open framework manager click on any query subject and go Open the properties dialogue box and u can observe the EXTERNALIZE METHOD u change the iqd. Q.How to pass multiple values from pick list prompt to sub report filter? Ans: #parameter1#. Give this in ur filter where u pass multiple values Q.What is Stitch Query in Reportnet? Ans: Framework Manager generates a separate query for each 'FACT' table and joins the result set. This is called Stitched query. I think when there is no join between two tables and we are dragging in 2 cols from 2 diff tables which doesn’t have joins then cognos will automatically build 2 or more select clauses with full outer join stitching in the 2 select clauses this is called stitched query Q.What is Snapshot ? Ans: A Snapshot is a Permanent Local Copy Of the Report.Snapshot is Static Data Source it is saved with .imr File it is suitable for Disconnected Network.... Q.What is cube size? Ans: 2.0 GBIt depends on ur project requirements. Q.What is log in cognos? Ans: While creating reports or creating models logs have all information till session closed. Q.What is associated grouping? And how it works in cognos impromptu? Ans: You can associate one or more data items with a grouped data item. An associated data item should have a one-to-one relationship with the grouped data item. For example, Order No. is a grouped data item. For each order number, there is an Order Date. Order No. and Order Date have a one-to-one relationship. Q.What are the limitations of cognos reportnet? Ans: In CRN we can't view multi dimensionally, -we cant' see a report in excel, -we can't format a report in CRN Report net does not support drill through and also bursting of reports is no possible in report net. Also it does not support dimensional analysis Q.What is loop in framework manager ? Ans: Loop is an very dangerous exception in framework manager we can resolve the loop create alias table. loop Display wrong results in the reportnet. A Loop is a Closed Path Circuit.... Avoid loops: using shortcuts.We have ambiguous relationships types:Hierarchical relationships Recursive relationships Multi-valid relationships. To avoid these relationships using shortcuts. Q.What are the different ways of adding data in Transformer? Ans: In transformer u import metadata from architect or catalog to create a cube. We just import metadata. we don’t add data to it Q.What are slowly changing dimensions? Why we are used SCD? Ans: Slowly Changing Dimensions are those whose data are not fixed. SCD types: SCD type 1: Whose data is not fixed. Historical data are not saved. Data keeps on changing.-> Current Data SCD type 2: Whose data is fixed and also save historical data. SCD type 3: Like SCD type 2 only but save historical data in another table. In one table data will updated and in another table historical data will be there. Depends what we are going to use a DWH or Database. Q.What r the names of the reports that u prepared? Ans: List report, Cross tab report, pie charts etcs.... Master detail, Drill through, CascadingList, Cross tab, Chart, Map report are the basic report of Cognos. cognos interview questions cognos interview questions and answers Q.What is the importance of Dimension in the cognos.? Ans: Without Dimension and Fact cannot make a relation between tables it could not be use for joins and retrieve the data as a form of reports in cognos. Dimension is a major subject area through which we can analysis our business. Q.What is exact catalog size? Ans: There is no limit for catalog size. It may be 3MB Or 4MB. Q.Give me some examples for Lifecycle reporting, I mean which lifecycle we will use for reporting? Ans: There is no specific reporting lifecycle. We can generate the reports from data warehouse/data marts or we can generate the reports directly from OLTP systems.What I mean generating reports from OLTP systems is loading data into one system and generating the reports. But this is not recommended. This will depends on the business. 1) Generating reports from the OLAP interface system retrieving the data from the data warehouse to generating forms and reports. 2) They can use Business intelligence project life cycle. Q.HOW MANY LEVELS CAN BE USED IN DRILL THROUGH REPORTS? Ans: Two levels summary and detail. Q.Where u can save the Report net documentation in our local system? Ans: Whenever we Install Cognos Server we get it documentation alone otherwise we can store it in Deployment Folder of Cognos Q.What is model and how to create model and how to test model? Ans: Create a Model In framework Manager and then publish it, make sure that Model meets the requirements, don't overload the model with unwanted query items Model is nothing but a project. And the project will be created by framework manager in which we can extract the metadata from the various operational sources... we will test the model by using the TOAD Model contains meta data which is related to business reporting Information. We import this data from the data base. Using this metadata we are creating Packages, Publishing packages, This metadata contains query subjects and namespaces......... Q.How to crate cubes in transformer ? Ans: Cube can be created by using different data sources. Data sources: iqd, tex file Cube creation by using iqd data sources. These iqd sources are coming from impromptu Q.How u burst the reports? if u r bursted reports is not reached to destination how u identify???When You Import data into catalog You have complex columns names. How do you change the Name of those columns? Ans: First you have to click on column header which you have to change. Then in property pane you will found name option. Then change column name. Q.How to create measures and Dimensions? Ans: By using the Framework Manager we can create measures and dimensions according to our business needs. Q.How to pass parameter value into html design page? Ans: (That is YOU Create a html page and use that page in your cognos 8 report header.take my question is how to use parameter value in the html page .) By writing the JavaScript code in the html item, we can insert the parameter values. Here is the URL to use to call report from java The following URL will work. Please note that prompt names are case sensitive and we need to prefix 'p_' to the defined prompt names. Although this passes the prompt values, it doesn’t suppress the prompt page. So, in order to achieve that we include an additional expression (&run.prompt=false) http:///cognos8/cgi-bin/cognos.cgi?b_action=xts.run&m=portal/report-viewer.xts&ui.action=run&ui.object=%2fcontent%2fpackage%5b%40name%3d%27(optional)%27%5d%2freport%5b%40name%3d%27%27%5d&run.prompt=false&p_p_year=2007&p_p_month=09 Here p_year and p_month are the parameter name for prompt Q.WHAT IS THE DIFFERENCE BETWEEN LIST REPORT AND CROSSTAB REPORT? Ans: List report show data in rows and columns versus a cross tab report show data in grids with dimension in rows and columns and measures in cells or in the intersection points. List Report and Cross tab report both contains rows and columns but difference is list report is a detail report for analysis whereas cross tab report is the intersection of data or summarized data for analysis. Q.How to create a dynamic column name in Cognos? Ans: These are the steps - Create a calculated column which contains the information that the header is to contain, such as "Report for year 1999" (concatenated text and date to string sub string extraction). Highlight the report, and then right-click. iii. Select Properties, and then click the Headers/Footers tab. Clear the Column Title Header check box. This will remove the headers from your columns. Reinsert the rest of the column headers; insert text will work. For the dynamic column, from the Insert menu, click Data and select the calculated column you created and insert it into the report. In Cognos 8.0, First create a Calculated Data Item, select the list, associate it with the Query in Which Calculated Data Item is created, then click on Structure and then List Header and Footers. Check List Header and make the Box Type of the Column header as None. Unlock and then drag the Calculated Data Item into the required header which will look like a Column Header in the report. For making a column name dynamic the only thing that you have to do is insert a layout calculations from the toolbox tab in report studio. I came across this answer because I wanted each column in across tabulation to have a text description different from the variable (column's) name. Cognos's gloriously useless documentation says nothing on how to do this and adding text isn't an allowed action in page design mode (the error message you get goes beyond unhelpful). To have the column titles, row footers, etc., you want: 1) Switch to page structure view in report studio 2) Expand the crosstab by clicking the plus signs untilcrosstab row levels, crosstab level(variable_name), etc. are visible. 3) From the insertable objects, drag a 'Block' at the level you’d like to add the text. This takes some experimenting. 4) after the block's been added, add into the block a Text Item from the insertable objects. 5) a window will open and you can type in whatever text you want.It's no wonder Jim Goodnight has such harshness towards Cognos. Q) What r the migration tools available in the market with respect to cognos such as impromptu reports are migrated to cognos reportnet? Ans: Reportnet1.1 migration tool is available to migrate the impromptu report to reportnet. Q) Can reportnet supports cubes? Ans: No Q) What are versions of cognos from starting release to latest in the market? Ans: Cognos EP7 series,Cognos reportnet1.1 Mr1,MR2,MR3,MR4,cognos8.0,8.1, 8.2,8.3 Q) What is IQD? What is contained in IQD? How you create IQD In ReportNet FrameWork? Ans: IQD is impromptu Query Definition. It's a transformer report. To use an IQd in reportnet framework manager, there is a process called externalization using which you can import the IQD. IQD Is impromptu query definition, the name it self indicating dat, it contains the SQL statement. To create the iqd in Frame work manager, create a new query subject with required query items . Set the Externalise method property of that Query subject to 'iqd'. While the publishing the package select the 'Generate the files for Externalized query subjects'. And publish it to the local machine, which can be used as data source for Transfromer model. Q) How u provide security to reports? How u provide security to packages ? Ans: Through Access Manager. Q) How to select multiple values from Type-in prompt? Ans: Example - I want to enter into type-in prompt in 'Product name--- Liza' display report data in 'product name is Leza' only OR I want to enter into type-in prompt in 'All', that time display report data in all are report data. Q) How is possible this Scenario using in type-in prompt.? Ans: Here we can enter one value, here it can't shows lovs. In catalog colomns we can select the value for which we can insert the pick list. Can you be more specific abut this. What i understand from the question, you have a prompt page with a text box prompt and the user types the value. When the user types the product name ' Ex: laptop' and clicks 'finish' the report page has to display all the records relevent to laptop. When the user types 'All' in the text box prompt and clicks 'Finish' the report has to display all the products. BY USING STRING VARIABLE WE CAN SELECT(OR)ENTER MULTIPLE VALUES By using the following condition as the filter for the textbox prompt we can get the data for the specified product as well as All products when we type ALL in textbox prompt Product name = ?P_ProductName? or 'ALL' in (?P_ProductName?) Q.What is Defect/Bug/Error Life Cycle? Ans: Defect--During requirement phase if we find its called as defect.A defect in requirement. Bug-----During testing phase if we find its called a BUG. Error Life Cycle- 1) Open 2) Assigned(assigned to dev) 3) Dev-passed(after fixing) 4) Under testing(when testing team receives the fix) 5) Closed (if working fine) else reopen Defect is the issue caught while Testing, When the as said is accepted by the developers then the as said is called as a Bug;Error always occurs in the coding during Unit Testing by the development team. (Issues regarding the coding are said Error's) A Mistake in coding is called an ERROR. This mistake detected by test engineer during testing called defect(or) issue. This defect (or)issue accepted by defect tracking team as code related defect called BUG. When we found a defect /Bug we will set the status New and set the severity and then assign to the junior senior people based on the severity they will set the priority and then they will assign to corresponding developer based on the severity dev fix it. And again tester again re-tested he will fix correctly close the bug, otherwise then again re assign to the corresponding dev. Framework Manager Q.What is Framework manager? Ans: Frame Work Manager is windows based metadata development or metadata modeling tool for Cognos Report Net. Q.What is Folder? Ans: A Folder is used to organize the Query Subjects. Q.Define Relationship? Ans: Relationship is a connection that explains how the data in one Query Subject relates to data in other Query Subjects. Q.What is a Package? Ans: A container for models, reports, and so on. Modelers create packages in Framework Manager to publish models to the ReportNet server. Q.What is Physical layer? Ans: The Physical layer provides the physical query layer and is made up primarily of data source and stored procedure query subjects. It acts as the foundation for the presentation layer. Q.What is Presentation layer? Ans: The Presentation layer makes it easier for report authors to find and understand their data. The Presentation layer is made up primarily of model query subjects that you create. Q.What is a Model? Ans: A model in Frame work manager is a business presentation of the structure of the data from one or more databases. Or A model is a set of related query subjects and other objects. Q.What are Fact Tables? Ans: A Fact Table is a table that contains summarized numerical (facts) and historical data. This Fact Table has a foreign key – primary key relation with a dimension table. Q.What are the types of Facts? Ans: The types of Facts are: Additive Facts:- A Fact which can be summed up for any of the dimension available in the Fact Table. Semi-Additive Facts:- A Fact which can be summed up to a few dimensions and not for all dimensions and not for all dimensions available in the Fact Table. Non-Additive Fact:- a fact which cannot be summed up for any of the dimensions available in the Fact Table. Q.What are the types of SQL? Ans: SQL is the industry language for creating, updating and querying relational data base management system. Types of SQL. Cognos SQL Native SQL Pass-through SQL. Q.Define Cognos SQL? Ans: By default Cognos FM uses Cognos SQL to create and edit Query subjects. Advantages: Can contain metadata from multiple data sources. Have fewer database restrictions Interact more effectively with Cognos applications. Disadvantages: You can not enter nonstandard SQL. Q.Define Native SQL? Ans: Native SQL is the SQL, the data source uses, such as Oracle SQL, but you cannot uses Native SQL in a query subject that references more than one data source in the project. Advantages: Performance is optimized across all related query subjects. You can use SQL that is specific to your database. Disadvantages: You cannot use SQL that the data source does not support for sub queries. The query subject may not work on different database type. Q.Define Pass-Through SQL? Ans: Pass-Through SQL lets you use native SQL without any of the restrictions the data source imposes on sub queries. Advantages: You can enter any SQL supported by the database. Disadvantages: 1.There is no opportunity for Frame work manager to automatically optimize performance. The SQL may not work on a different data source. Q.What are Query Processing Types? Ans: There are two types of query processing. Limited Local: The database server does as much of the SQL processing and Execution as possible. However, some reports or report sections use local SQL processing. Database only: The database server does all the SQL processing and execution with the exception of tasks not supported by the database. An error appears if any reports of report sections require local SQL processing. Q.What is Query Subject? Ans: A Query Subject maps to the table in the database. A Query Subject uses an SQL to retrieve the data from the data source. A Query Subject is also known as Business View. Q.What are the sources to create new Query subjects? Ans: A new Query Subject can be created from the following sources. Model (Query subjects & Query Items) Data Sources (Tables and Columns) Stored Procedure. Q.What is Multi Database Access in Cognos Report Net? Ans: In Cognos Report Net a Project can be created from multiple databases. The databases can be either homogenous or hydrogenous. Q.What are Parameter Maps? Ans: A Parameter Map is a key value pair used in creation of conditional Query Subjects. A Parameter map sub situates the values at the query at runtime. Contact for more on Cognos Online Training
Continue reading
Data Modeling Interview Questions
Q.What is Data warehousing? Ans: A data warehouse can be considered as a storage area where interest specific or relevant data is stored irrespective of the source. What actually is required to create a data warehouse can be considered as Data Warehousing. Data warehousing merges data from multiple sources into an easy and complete form. Q.What are fact tables and dimension tables? Ans: As mentioned, data in a warehouse comes from the transactions. Fact table in a data warehouse consists of facts and/or measures. The nature of data in a fact table is usually numerical. On the other hand, dimension table in a data warehouse contains fields used to describe the data in fact tables. A dimension table can provide additional and descriptive information (dimension) of the field of a fact table. e.g. If I want to know the number of resources used for a task, my fact table will store the actual measure (of resources) while my Dimension table will store the task and resource details. Hence, the relation between a fact and dimension table is one to many. Q.What is ETL process in data warehousing? Ans: ETL is Extract Transform Load. It is a process of fetching data from different sources, converting the data into a consistent and clean form and load into the data warehouse. Different tools are available in the market to perform ETL jobs. Q.Explain the difference between data mining and data warehousing. Ans: Data warehousing is merely extracting data from different sources, cleaning the data and storing it in the warehouse. Where as data mining aims to examine or explore the data using queries. These queries can be fired on the data warehouse. Explore the data in data mining helps in reporting, planning strategies, finding meaningful patterns etc. E.g. a data warehouse of a company stores all the relevant information of projects and employees. Using Data mining, one can use this data to generate different reports like profits generated etc. Q.What is an OLTP system and OLAP system? Ans: OLTP: Online Transaction and Processing helps and manages applications based on transactions involving high volume of data. Typical example of a transaction is commonly observed in Banks, Air tickets etc. Because OLTP uses client server architecture, it supports transactions to run cross a network. OLAP: Online analytical processing performs analysis of business data and provides the ability to perform complex calculations on usually low volumes of data. OLAP helps the user gain an insight on the data coming from different sources (multi dimensional). Q.What is PDAP? Ans: A data cube stores data in a summarized version which helps in a faster analysis of data. The data is stored in such a way that it allows reporting easily. E.g. using a data cube A user may want to analyze weekly, monthly performance of an employee. Here, month and week could be considered as the dimensions of the cube. Q.What is snow flake scheme design in database? Ans: A snowflake Schema in its simplest form is an arrangement of fact tables and dimension tables. The fact table is usually at the center surrounded by the dimension table. Normally in a snow flake schema the dimension tables are further broken down into more dimension table. E.g. Dimension tables include employee, projects and status. Status table can be further broken into status_weekly, status_monthly. Q.What is analysis service? Ans: Analysis service provides a combined view of the data used in OLAP or Data mining. Services here refer to OLAP, Data mining. Q.Explain sequence clustering algorithm. Ans: Sequence clustering algorithm collects similar or related paths, sequences of data containing events. E.g. Sequence clustering algorithm may help finding the path to store a product of “similar” nature in a retail ware house. Q.Explain discrete and continuous data in data mining. Ans: Discreet data can be considered as defined or finite data. E.g. Mobile numbers, gender. Continuous data can be considered as data which changes continuously and in an ordered fashion. E.g. age Q.Explain time series algorithm in data mining. Ans: Time series algorithm can be used to predict continuous values of data. Once the algorithm is skilled to predict a series of data, it can predict the outcome of other series. E.g. Performance one employee can influence or forecast the profit Q.What is XMLA? Ans: XMLA is XML for Analysis which can be considered as a standard for accessing data in OLAP, data mining or data sources on the internet. It is Simple Object Access Protocol. XMLA uses discover and Execute methods. Discover fetched information from the internet while Execute allows the applications to execute against the data sources. Q.Explain the difference between Data warehousing and Business Intelligence. Ans: Data Warehousing helps you store the data while business intelligence helps you to control the data for decision making, forecasting etc. Data warehousing using ETL jobs, will store data in a meaningful form. However, in order to query the data for reporting, forecasting, business intelligence tools were born. Q.What is Dimensional Modeling? Ans: Dimensional modeling is often used in Data warehousing. In simpler words it is a rational or consistent design technique used to build a data warehouse. DM uses facts and dimensions of a warehouse for its design. A snow and star flake schema represent data modeling Q.What is surrogate key? Explain it with an example. Ans: Data warehouses commonly use a surrogate key to uniquely identify an entity. A surrogate is not generated by the user but by the system. A primary difference between a primary key and surrogate key in few databases is that PK uniquely identifies a record while a SK uniquely identifies an entity. E.g. an employee may be recruited before the year 2000 while another employee with the same name may be recruited after the year 2000. Here, the primary key will uniquely identify the record while the surrogate key will be generated by the system (say a serial number) since the SK is NOT derived from the data. Q.What is the purpose of Factless Fact Table? Ans: Fact less tables are so called because they simply contain keys which refer to the dimension tables. Hence, they don’t really have facts or any information but are more commonly used for tracking some information of an event. Eg. To find the number of leaves taken by an employee in a month. Q.What is a level of Granularity of a fact table? Ans: A fact table is usually designed at a low level of Granularity. This means that we need to find the lowest level of information that can store in a fact table. E.g. Employee performance is a very high level of granularity. Employee_performance_daily, employee_perfomance_weekly can be considered lower levels of granularity. Q.Explain the difference between star and snowflake schemas. Ans: A snow flake schema design is usually more complex than a start schema. In a start schema a fact table is surrounded by multiple fact tables. This is also how the Snow flake schema is designed. However, in a snow flake schema, the dimension tables can be further broken down to sub dimensions. Hence, data in a snow flake schema is more stable and standard as compared to a Start schema. E.g. Star Schema: Performance report is a fact table. Its dimension tables include performance_report_employee, performance_report_manager Snow Flake Schema: the dimension tables can be broken to performance_report_employee_weekly, monthly etc. Q.What is the difference between view and materialized view? Ans: A view is created by combining data from different tables. Hence, a view does not have data of itself. On the other hand, Materialized view usually used in data warehousing has data. This data helps in decision making, performing calculations etc. The data stored by calculating it before hand using queries. When a view is created, the data is not stored in the database. The data is created when a query is fired on the view. Whereas, data of a materialized view is stored. Data Modeling Interview Questions Data Modeling Interview Questions and Answers Q.What is junk dimension? Ans: In scenarios where certain data may not be appropriate to store in the schema, this data (or attributes) can be stored in a junk dimension. The nature of data of junk dimension is usually Boolean or flag values. E.g. whether the performance of employee was up to the mark? , Comments on performance. Q.What are fundamental stages of Data Warehousing? Ans: Stages of a data warehouse helps to find and understand how the data in the warehouse changes. At an initial stage of data warehousing data of the transactions is merely copied to another server. Here, even if the copied data is processed for reporting, the source data’s performance won’t be affected. In the next evolving stage, the data in the warehouse is updated regularly using the source data. In Real time Data warehouse stage data in the warehouse is updated for every transaction performed on the source data (E.g. booking a ticket) When the warehouse is at integrated stage, It not only updates data as and when a transaction is performed but also generates transactions which are passed back to the source online data. Q.What is Data Scheme? Ans: Data Scheme is a diagrammatic representation that illustrates data structures and data relationships to each other in the relational database within the data warehouse. The data structures have their names defined with their data types. Data Schemes are handy guides for database and data warehouse implementation. The Data Scheme may or may not represent the real lay out of the database but just a structural representation of the physical database. Data Schemes are useful in troubleshooting databases. Q.What is Bit Mapped Index? Ans: Bitmap indexes make use of bit arrays (bitmaps) to answer queries by performing bitwise logical operations. They work well with data that has a lower cardinality which means the data that take fewer distinct values. Bitmap indexes are useful in the data warehousing applications. Bitmap indexes have a significant space and performance advantage over other structures for such data. Tables that have less number of insert or update operations can be good candidates. The advantages of Bitmap indexes are: They have a highly compressed structure, making them fast to read. Their structure makes it possible for the system to combine multiple indexes together so that they can access the underlying table faster. The Disadvantage of Bitmap indexes is: The overhead on maintaining them is enormous. Q.What is Bi-directional Extract? Ans: In hierarchical, networked or relational databases, the data can be extracted, cleansed and transferred in two directions. The ability of a system to do this is refered to as bidirectional extracts. This functionality is extremely useful in data warehousing projects. Data Extraction The source systems the data is extracted from vary in various forms right from their structures and file formats to the department and the business segment they belong to. Common source formats include flat files and relational database and other non-relational database structures such as IMS, VSAM or ISAM. Data transformation The extracted data may undergo transformation with possible addition of metadata before they are exported to another large storage area. In transformation phase, various functions related to business needs, requirements, rules and policies are applied on them. During this process some values even get translated and encoded. Care is also taken to avoid redundancy of data. Data Cleansing In data cleansing, scrutinizing of the incorrect or corrupted data is done and those inaccuracies are removed. Thus data consistency is ensured in Data cleansing. It involves activities like - removing typographical errors and inconsistencies - comparing and validating data entries against a list of entities Data transformation This is the last process of Bidirectional Extracts. The cleansed, transformed extracted source data is then loaded into the data warehouse. Advantages - Updates and data loading become very fast due to bidirectional extracting. - As timely updates are received in a useful pattern companies can make good use of this data to launch new products and formulate market strategies. Disadvantage - More investment on advance and faster IT infrastructure. - Not being able to come up with fault tolerance may mean unexpected stoppage of operations when the system breaks. - Skilled data administrator needs to be hired to manage the complex process. Q.What is Data Collection Frequency? Ans: Data collection frequency is the rate at which data is collected. However, the data is not just collected and stored. it goes through various stages of processing like extracting from various sources, cleansing, transforming and then storing in useful patterns. It is important to have a record of the rate at which data is collected because of various reasons: Companies can use these records to keep a track of the transactions that have occurred. Based on these records the company can know if any invalid transactions ever occurred. In scenarios where the market changes rapidly, companies need very frequently updated data to enable them make decisions based on the state of the market and then invest appropriately. A few companies keep launching new products and keep updating their records so that their customers can see them which would in turn increase their business. When data warehouses face technical problems, the logs as well as the data collection frequency can be used to determine the time and cause of the problem. Due to real time data collection, database managers and data warehouse specialists can make more room for recording data collection frequency. Q.What is Data Cardinality? Ans: Cardinality is the term used in database relations to denote the occurrences of data on either side of the relation. There are 3 basic types of cardinality: High data cardinality: Values of a data column are very uncommon. e.g.: email ids and the user names Normal data cardinality: Values of a data column are somewhat uncommon but never unique. e.g.: A data column containing LAST_NAME (there may be several entries of the same last name) Low data cardinality: Values of a data column are very usual. e.g.: flag statuses: 0/1 Determining data cardinality is a substantial aspect used in data modeling. This is used to determine the relationships Types of cardinalities: The Link Cardinality - 0:0 relationships The Sub-type Cardinality - 1:0 relationships The Physical Segment Cardinality - 1:1 relationship The Possession Cardinality - 0: M relation The Child Cardinality - 1: M mandatory relationsh The Characteristic Cardinality - 0: M relationship The Paradox Cardinality - 1: M relationship. Q.What is Chained Data Replication? Ans: In Chain Data Replication, the non-official data set distributed among many disks provides for load balancing among the servers within the data warehouse. Blocks of data are spread across clusters and each cluster can contain a complete set of replicated data. Every data block in every cluster is a unique permutation of the data in other clusters. When a disk fails then all the calls made to the data in that disk are redirected to the other disks when the data has been replicated. At times replicas and disks are added online without having to move around the data in the existing copy or affect the arm movement of the disk. In load balancing, Chain D.ata Replication has multiple servers within the data warehouse share data request processing since data already have replicas in each server disk. Q.What are Critical Success Factors? Ans: Key areas of activity in which favorable results are necessary for a company to reach its goal. There are four basic types of CSFs which are: Industry CSFs Strategy CSFs Environmental CSFs Temporal CSFs A few CSFs are: Money Your future Customer satisfaction Quality Product or service development Intellectual capital Strategic relationships Employee attraction and retention Sustainability The advantages of identifying CSFs are: they are simple to understand; they help focus attention on major concerns; they are easy to communicate to coworkers; they are easy to monitor; and they can be used in concert with strategic planning methodologies. Q.What is Virtual Data Warehousing? Ans: A virtual data warehouse provides a collective view of the completed data. A virtual data warehouse has no historic data. It can be considered as a logical data model of the containing metadata. Q.Explain in brief various fundamental stages of Data Warehousing. Ans: Stages of a data warehouse helps to find and understand how the data in the warehouse changes. At an initial stage of data warehousing data of the transactions is merely copied to another server. Here, even if the copied data is processed for reporting, the source data’s performance won’t be affected. In the next evolving stage, the data in the warehouse is updated regularly using the source data. In Real time Data warehouse stage data in the warehouse is updated for every transaction performed on the source data (E.g. booking a ticket) When the warehouse is at integrated stage, It not only updates data as and when a transaction is performed but also generates transactions which are passed back to the source online data. Q.What is active data warehousing? Ans: An active data warehouse represents a single state of the business. Active data warehousing considers the analytic perspectives of customers and suppliers. It helps to deliver the updated data through reports. Q.What is data modeling and data mining? What is this used for? Ans: Data Modeling is a technique used to define and analyze the requirements of data that supports organization’s business process. In simple terms, it is used for the analysis of data objects in order to identify the relationships among these data objects in any business. Data Mining is a technique used to analyze datasets to derive useful insights/information. It is mainly used in retail, consumer goods, telecommunication and financial organizations that have a strong consumer orientation in order to determine the impact on sales, customer satisfaction and profitability. Data Mining is very helpful in determining the relationships among different business attributes. Q.Difference between ER Modeling and Dimensional Modeling? Ans: The entity-relationship model is a method used to represent the logical flow of entities/objects graphically that in turn create a database. It has both logical and physical model. And it is good for reporting and point queries. Dimensional model is a method in which the data is stored in two types of tables namely facts table and dimension table. It has only physical model. It is good for ad hoc query analysis. Q.What is the difference between data warehousing and business intelligence? Ans: Data warehousing relates to all aspects of data management starting from the development, implementation and operation of the data sets. It is a back up of all data relevant to business context i.e. a way of storing data Business Intelligence is used to analyze the data from the point of business to measure any organization’s success. The factors like sales, profitability, marketing campaign effectiveness, market share and operational efficiency etc are analyzed using Business Intelligence tools like Cognos, Informatica, SAS etc. Q.Describe dimensional Modeling. Ans: Dimensional model is a method in which the data is stored in two types of tables namely facts table and dimension table. Fact table comprises of information to measure business successes and the dimension table comprises of information on which the business success is calculated. It is mainly used by data warehouse designers to build data warehouses. It represents the data in a standard and sequential manner that triggers for high performance access. Q.What is snapshot with reference to data warehouse? Ans: Snapshot refers to a complete visualization of data at the time of extraction. It occupies less space and can be used to back up and restore data quickly. Contact for more information on Data Modeling Online Training
Continue reading
Data Science Interview Questions
Q.What do you mean by word Data Science? Ans: Data Science is the extraction of knowledge from large volumes of data that are structured or unstructured, which is a continuation of the field data mining and predictive analytics, It is also known as knowledge discovery and data mining Q.Explain the term botnet? Ans: A botnet is a a type of bot running on an IRC network that has been created with a Trojan. Q.What is Data Visualization? Ans: Data visualization is a common term that describes any effort to help people understand the significance of data by placing it in a visual context. Q.How you can define Data cleaning as a critical part of process? Ans: Cleaning up data to the point where you can work with it is a huge amount of work. If we’re trying to reconcile a lot of sources of data that we don’t control like in this flight, it can take 80% of our time. Q.Point out 7 Ways how Data Scientists use Statistics? Ans: Design and interpret experiments to inform product decisions. 2. Build models that predict signal, not noise. 3. Turn big data a into the big picture 4. Understand user retention, engagement, conversion, and leads. 5. Give your users what they want. 6. Estimate intelligently. 7. Tell the story with the data. Q.Differentiate between Data modeling and Database design? Ans: Data Modeling – Data modeling (or modeling) in software engineering is the process of creating a data model for an information system by applying formal data modeling techniques. Database Design- Database design is the system of producing a detailed data model of a database. The term database design can be used to describe many different parts of the design of an overall database system. Q.Describe in brief the data Science Process flowchart? Ans: 1.Data is collected from sensors in the environment. 2. Data is “cleaned” or it can process to produce a data set (typically a data table) usable for processing. 3. Exploratory data analysis and statistical modeling may be performed. 4. A data product is a program such as retailers use to inform new purchases based on purchase history. It may also create data and feed it back into the environment. Q.What do you understand by term hash table collisions? Ans: Hash table (hash map) is a kind of data structure used to implement an associative array, a structure that can map keys to values. Ideally, the hash function will assign each key to a unique bucket, but sometimes it is possible that two keys will generate an identical hash causing both keys to point to the same bucket. It is known as hash collisions. Q.Compare and contrast R and SAS? Ans: SAS is commercial software whereas R is free source and can be downloaded by anyone. SAS is easy to learn and provide easy option for people who already know SQL whereas R is a low level programming language and hence simple procedures takes longer codes. Q.What do you understand by letter ‘R’? Ans: R is a low level language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at BELL. Q.What all things R environment includes? Ans: A suite of operators for calculations on arrays, in particular matrices, 2. An effective data handling and storage facility, 3. A large, coherent, integrated collection of intermediate tools for data analysis, an effective data handling and storage facility, 4. Graphical facilities for data analysis and display either on-screen or on hardcopy, and 5. A well-developed, simple and effective programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities. What are the applied Machine Learning Process Steps? Ans: Problem Definition: Understand and clearly describe the problem that is being solved. 2. Analyze Data: Understand the information available that will be used to develop a model. 3. Prepare Data: Define and expose the structure in the dataset. 4. Evaluate Algorithms: Develop robust test harness and baseline accuracy from which to improve and spot check algorithms. 5. Improve Results: Improve results to develop more accurate models. 6. Present Results: Details the problem and solution so that it can be understood by third parties. Q.Compare Multivariate, Univariate and Bivariate analysis? Ans: MULTIVARIATE: Multivariate analysis focuses on the results of observations of many different variables for a number of objects. UNIVARIATE: Univariate analysis is perhaps the simplest form of statistical analysis. Like other forms of statistics, it can be inferential or descriptive. The key fact is that only one variable is involved. BIVARIATE: Bivariate analysis is one of the simplest forms of quantitative (statistical) analysis. It involves the analysis of two variables (often denoted as X, Y), for the purpose of determining the empirical relationship between them. Q.What is Hypothesis in Machine Learning? Ans: The hypothesis space used by a machine learning system is the set of all hypotheses that might possibly be returned by it. It is typically dened by a hypothesis language, possibly in conjunction with a language bias. Q.Differentiate between Uniform and Skewed Distribution? Ans: UNIFORM DISTRIBUTION: A uniform distribution, sometimes also known as a rectangular distribution, is a distribution that has constant probability. The latter of which simplifies to the expected for . The continuous distribution is implemented as Uniform Distribution SKEWED DISTRIBUTION: In probability theory and statistics, Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive or negative, or even undefined. The qualitative interpretation of the skew is complicated. Q.What do you understand by term Transformation in Data Acquisition? Ans: The transformation process allows you to consolidate, cleanse, and integrate data. We can semantically arrange the data from heterogeneous sources. Q.What do you understand by term Normal Distribution? Ans: It is a function which shows the distribution of many random variables as a symmetrical bell-shaped graph. Q.What is Data Acquisition? Ans: It is the process of measuring an electrical or physical phenomenon such as voltage, current, temperature, pressure, or sound with a computer. A DAQ system comprises of sensors, DAQ measurement hardware, and a computer with programmable software. Q.What is Data Collection? Ans: Data collection is the process of collecting and measuring information on variables of interest, in a proper systematic fashion that enables one to answer stated research questions hypotheses, and revise outcomes. Q.What do you understand by term Use case? Ans: A use case is a methodology used in system analysis to identify, clarify, and organize system requirements. The use case consists of a set of possible sequences of interactions between systems and users in a particular environment and related to a defined particular goal. Q.What is Sampling and Sampling Distribution? Ans: SAMPLING: Sampling is the process of choosing units (ex- people, organizations) from a population of interest so that by studying the sample we can fairly generalize our results back to the population from which they were chosen. SAMPLING DISTRIBUTION: The sampling distribution of a statistic is the distribution of that statistic, considered as a random variable, when derived from a random sample of size n. It may be considered as the distribution of the statistic for all possible samples from the same population of a given size. Q.What is Linear Regression? Ans: In statistics, linear regression is an way for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variable) denoted by X. The case of one explanatory variable is known as simple linear regression. Q.Differentiate between Extrapolation and Interpolation? Ans: Extrapolation is an approximate of a value based on extending a known sequence of values or facts beyond the area that is certainly known. Interpolation is an estimation of a value within two known values in a list of values. Q.How expected value is different from Mean value? Ans: There is no difference. These are two names for the same thing. They are mostly used in different contexts, though if we talk about the expected value of a random variable and the mean of a sample, population or probability distribution. Q.Differentiate between Systematic and Cluster Sampling? Ans: SYSTEMATIC SAMPLING: Systematic sampling is a statistical methology involving the selection of elements from an ordered sampling frame. The most common form of systematic sampling is an equal-probability method. CLUSTER SAMPLING: A cluster sample is a probability sample by which each sampling unit is a collection, or cluster, of elements. Q.What are the advantages of Systematic Sampling? Ans: 1.Easier to perform in the field, especially if a proper frame is not available. 2. Regularly provides more information per unit cost than simple random sampling, in the sense of smaller variances. Q.What do you understand by term Threshold limit value? Ans: The threshold limit value (TLV) of a chemical substance is a level in which it is believed that a worker can be exposed day after day for a working lifetime without affecting his/her health. Q.Differentiate between Validation Set and Test set? Ans: Validation set: It is a set of examples used to tune the parameters of a classifier, for example to choose the number of hidden units in a neural network. Test set: A set of examples used only to assess the performance of a fully specified classifier. Q.How can R and Hadoop be used together? Ans: The most common way to link R and Hadoop is to use HDFS (potentially managed by Hive or HBase) as the long-term store for all data, and use Map Reduce jobs (potentially submitted from Hive, Pig, or Oozie) to encode, enrich, and sample data sets from HDFS into R. Data analysts can then perform complex modeling exercises on a subset of prepared data in R. Q.What do you understand by term RIMPALA? Ans: RImpala-package contains the R functions required to connect, execute queries and retrieve back results from Impala. It uses the rJava package to create a JDBC connection to any of the impala servers running on a Hadoop Cluster. Q.What is Collaborative Filtering? Ans: Collaborative filtering (CF) is a method used by some recommender systems. It consists of two senses, a narrow one and a more general one. In general, collaborative filtering is the process of filtering for information or patterns using techniques involving collaboration among multiple agents, viewpoints, data sources. Q.What are the challenges of Collaborative Filtering? Ans: Scalability 2. Data sparsity 3. Synonyms 4. Grey sheep Data sparsity 5. Shilling attacks 6. Diversity and the Long Tail Q.What do you understand by Big data? Ans: Big data is a buzzword, or catch-phrase, which describe a massive volume of both structured and unstructured data that is so large which is difficult to process using traditional database and software techniques. Q.What do you understand by Matrix factorization? Ans: Matrix factorization is simply a mathematical tool for playing around with matrices, and is therefore applicable in many scenarios by which one would find out something hidden under the data. Q.What do you understand by term Singular Value Decomposition? Ans: In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix. It has many useful applications in signal processing and statistics. Q.What do you mean by Recommender systems? Ans: Recommender systems or recommendation systems (sometimes replacing “system” with a synonym such as platform or engine) are a subclass of information filtering system that seek to predict the ‘rating’ or ‘preference’ that a user would give to an item. Q.What are the applications of Recommender Systems? Ans: Recommender systems have become extremely common in recent years, and are applied in a variety of applications. The most popular ones are probably movies, music, news, books, research articles, search queries, social tags, and products in general. Q.What are the two ways of Recommender System? Ans: Recommender systems typically produce a list of recommendations in one of two ways: Through collaborative or content-based filtering. Collaborative filtering approaches building a model from a user’s past behavior (items previously purchased or selected and/or numerical ratings given to those items) as well as similar decisions made by other users. This model is then used to predict items (or ratings for items) that the user may have an interest in. Content-based filtering approaches utilize a series of discrete characteristics of an item in order to recommend additional items with similar properties. Q.What are the factors to find the most accurate recommendation algorithms? Ans: Diversity 2. Recommender Persistence 3. Privacy 4. User Demographics 5. Robustness 6. Serendipity 7. Trust 8. Labeling Q.What is K-Nearest Neighbor? Ans: k-NN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until classification. The k-NN algorithm is among the simplest of all machine learning algorithms. Q.What is Horizontal Slicing? Ans: In horizontal slicing, projects are broken up roughly along architectural lines. That is there would be one team for UI, one team for business logic and services (SOA), and another team for data. Q.What are the advantages of vertical slicing? Ans: The advantage of slicing vertically is you are more efficient. You don’t have the overhead, and effort that comes from trying to coordinate activities across multiple teams. No need to negotiate for resources. You’re all on the same team. Q.What is null hypothesis? Ans: In inferential statistics the null hypothesis usually refers to a general statement or default position that there is no relationship between two measured phenomena, or no difference among groups. Q.What is Statistical hypothesis? Ans: In statistical hypothesis testing, the alternative hypothesis (or maintained hypothesis or research hypothesis) and the null hypothesis are the two rival hypotheses which are compared by a statistical hypothesis test. Q.What is performance measure? Ans: Performance measurement is the method of collecting, analyzing and/or reporting information regarding the performance of an individual, group, organization, system or component. Q.What is the use of tree command? Ans: This command is used to list contents of directories in a tree-like format. Q.What is the use of uniq command? Ans: This command is used to report or omit repeated lines. Q.Which command is used translate or delete characters? Ans: tr command is used translate or delete characters. Q.What is the use of tapkee command? Ans: This command is used to reduce dimensionality of a data set using various algorithms. Q.Which command is used to sort the lines of text files? Ans: sort command is used to sort the lines of text files. Data Science Interview Questions and Answers in R Programming Q.How can you merge two data frames in R language? Ans: Data frames in R language can be merged manually using cbind () functions or by using the merge () function on common rows or columns. Q.Explain about data import in R language Ans: R Commander is used to import data in R language. To start the R commander GUI, the user must type in the command Rcmdr into the console. There are 3 different ways in which data can be imported in R language- Users can select the data set in the dialog box or enter the name of the data set (if they know). Data can also be entered directly using the editor of R Commander via Data->New Data Set. However, this works well when the data set is not too large. Data can also be imported from a URL or from a plain text file (ASCII), from any other statistical package or from the clipboard. Q.Two vectors X and Y are defined as follows – X
Continue reading
DataStage Interview Questions
Q1. Define Data Stage? Ans: A data stage is basically a tool that is used to design, develop and execute various applications to fill multiple tables in data warehouse or data marts. It is a program for Windows servers that extracts data from databases and change them into data warehouses. It has become an essential part of IBM WebSphere Data Integration suite. Q2. Explain how a source file is populated? Ans: We can populate a source file in many ways such as by creating a SQL query in Oracle, or by using row generator extract tool etc. Q3. Name the command line functions to import and export the DS jobs? Ans: To import the DS jobs, dsimport.exe is used and to export the DS jobs, dsexport.exe is used. Q4. What is the difference between Datastage 7.5 and 7.0? Ans: In Datastage 7.5 many new stages are added for more robustness and smooth performance, such as Procedure Stage, Command Stage, Generate Report etc. Q5. In Datastage, how you can fix the truncated data error? Ans: The truncated data error can be fixed by using ENVIRONMENT VARIABLE ‘ IMPORT_REJECT_STRING_FIELD_OVERRUN’. Q6. Define Merge? Ans: Merge means to join two or more tables. The two tables are joined on the basis of Primary key columns in both the tables. Q7. Differentiate between data file and descriptor file? Ans: As the name implies, data files contains the data and the descriptor file contains the description/information about the data in the data files. Q8. Differentiate between datastage and informatica? Ans: In datastage, there is a concept of partition, parallelism for node configuration. While, there is no concept of partition and parallelism in informatica for node configuration. Also, Informatica is more scalable than Datastage. Datastage is more user-friendly as compared to Informatica. Q9. Define Routines and their types? Ans: Routines are basically collection of functions that is defined by DS manager. It can be called via transformer stage. There are three types of routines such as, parallel routines, main frame routines and server routines. Q10. How can you write parallel routines in datastage PX? Ans: We can write parallel routines in C or C++ compiler. Such routines are also created in DS manager and can be called from transformer stage. Q11. What is the method of removing duplicates, without the remove duplicate stage? Ans: Duplicates can be removed by using Sort stage. We can use the option, as allow duplicate = false. Q12. What steps should be taken to improve Datastage jobs? Ans: In order to improve performance of Datastage jobs, we have to first establish the baselines. Secondly, we should not use only one flow for performance testing. Thirdly, we should work in increment. Then, we should evaluate data skews. Then we should isolate and solve the problems, one by one. After that, we should distribute the file systems to remove bottlenecks, if any. Also, we should not include RDBMS in start of testing phase. Last but not the least, we should understand and assess the available tuning knobs. Q13. Differentiate between Join, Merge and Lookup stage? Ans: All the three concepts are different from each other in the way they use the memory storage, compare input requirements and how they treat various records. Join and Merge needs less memory as compared to the Lookup stage. Q14. Explain Quality stage? Ans: Quality stage is also known as Integrity stage. It assists in integrating different types of data from various sources. Q15. Define Job control? Ans: Job control can be best performed by using Job Control Language (JCL). This tool is used to execute multiple jobs simultaneously, without using any kind of loop. Q16. Differentiate between Symmetric Multiprocessing and Massive Parallel Processing? Ans: In Symmetric Multiprocessing, the hardware resources are shared by processor. The processor has one operating system and it communicates through shared memory. While in Massive Parallel processing, the processor access the hardware resources exclusively. This type of processing is also known as Shared Nothing, since nothing is shared in this. It is faster than the Symmetric Multiprocessing. Q17. What are the steps required to kill the job in Datastage? Ans: To kill the job in Datasatge, we have to kill the respective processing ID. Q18. Differentiate between validated and Compiled in the Datastage? Ans: In Datastage, validating a job means, executing a job. While validating, the Datastage engine verifies whether all the required properties are provided or not. In other case, while compiling a job, the Datastage engine verifies that whether all the given properties are valid or not. Q19. How to manage date conversion in Datastage? Ans: We can use date conversion function for this purpose i.e. Oconv(Iconv(Filedname,”Existing Date Format”),”Another Date Format”). Q20. Why do we use exception activity in Datastage? Ans: All the stages after the exception activity in Datastage are executed in case of any unknown error occurs while executing the job sequencer. Q21. Define APT_CONFIG in Datastage? Ans: It is the environment variable that is used to identify the *.apt file in Datastage. It is also used to store the node information, disk storage information and scratch information. Q22. Name the different types of Lookups in Datastage? Ans: There are two types of Lookups in Datastage i.e. Normal lkp and Sparse lkp. In Normal lkp, the data is saved in the memory first and then the lookup is performed. In Sparse lkp, the data is directly saved in the database. Therefore, the Sparse lkp is faster than the Normal lkp. Q23. How a server job can be converted to a parallel job? Ans: We can convert a server job in to a parallel job by using IPC stage and Link Collector. Q24. Define Repository tables in Datastage? Ans: In Datastage, the Repository is another name for a data warehouse. It can be centralized as well as distributed. Q25. Define OConv () and IConv () functions in Datastage? Ans: In Datastage, OConv () and IConv() functions are used to convert formats from one format to another i.e. conversions of roman numbers, time, date, radix, numeral ASCII etc. IConv () is basically used to convert formats for system to understand. While, OConv () is used to convert formats for users to understand. Q26. Explain Usage Analysis in Datastage? Ans: In Datastage, Usage Analysis is performed within few clicks. Launch Datastage Manager and right click the job. Then, select Usage Analysis and that’s it. Q27. How do you find the number of rows in a sequential file? Ans: To find rows in sequential file, we can use the System variable @INROWNUM. Q28. Differentiate between Hash file and Sequential file? Ans: The only difference between the Hash file and Sequential file is that the Hash file saves data on hash algorithm and on a hash key value, while sequential file doesn’t have any key value to save the data. Basis on this hash key feature, searching in Hash file is faster than in sequential file. Q29. How to clean the Datastage repository? Ans: We can clean the Datastage repository by using the Clean Up Resources functionality in the Datastage Manager. Q30. How a routine is called in Datastage job? Ans: In Datastage, routines are of two types i.e. Before Sub Routines and After Sub Routines. We can call a routine from the transformer stage in Datastage. Q31. Differentiate between Operational Datastage (ODS) and Data warehouse? Ans: We can say, ODS is a mini data warehouse. An ODS doesn’t contain information for more than 1 year while a data warehouse contains detailed information regarding the entire business. Q32. NLS stands for what in Datastage? Ans: NLS means National Language Support. It can be used to incorporate other languages such as French, German, and Spanish etc. in the data, required for processing by data warehouse. These languages have same scripts as English language. Q33. Can you explain how could anyone drop the index before loading the data in target in Datastage? Ans: In Datastage, we can drop the index before loading the data in target by using the Direct Load functionality of SQL Loaded Utility. Q34. How can one implement the slowly changing dimensions in Datastage? Ans: Slowly changing dimensions is not a concept related to Datastage. Datastage is used for ETL purpose and not for slowly changing dimensions. Q35. How can one find bugs in job sequence? Ans: We can find bugs in job sequence by using DataStage Director. Q36. How complex jobs are implemented in Datstage to improve performance? Ans: In order to improve performance in Datastage, it is recommended, not to use more than 20 stages in every job. If you need to use more than 20 stages then it is better to use another job for those stages. Q37. Name the third party tools that can be used in Datastage? Ans: The third party tools that can be used in Datastage, are Autosys, TNG and Event Co-ordinator. I have worked with these tools and possess hands on experience of working with these third party tools. Q38. Define Project in Datastage? Ans: Whenever we launch the Datastage client, we are asked to connect to a Datastage project. A Datastage project contains Datastage jobs, built-in components and Datastage Designer or User-Defined components. Q39. How many types of hash files are there? Ans: There are two types of hash files in DataStage i.e. Static Hash File and Dynamic Hash File. The static hash file is used when limited amount of data is to be loaded in the target database. The dynamic hash file is used when we don’t know the amount of data from the source file. Q40. Define Meta Stage? Ans: In Datastage, MetaStage is used to save metadata that is helpful for data lineage and data analysis. Q41. Have you have ever worked in UNIX environment and why it is useful in Datastage? Ans: Yes, I have worked in UNIX environment. This knowledge is useful in Datastage because sometimes one has to write UNIX programs such as batch programs to invoke batch processing etc. Q42. Differentiate between Datastage and Datastage TX? Ans: Datastage is a tool from ETL (Extract, Transform and Load) and Datastage TX is a tool from EAI (Enterprise Application Integration). Q43. What is size of a transaction and an array means in a Datastage? Ans: Transaction size means the number of row written before committing the records in a table. An array size means the number of rows written/read to or from the table respectively. Q44. How many types of views are there in a Datastage Director? Ans: There are three types of views in a Datastage Director i.e. Job View, Log View and Status View. Q45. Why we use surrogate key? Ans: In Datastage, we use Surrogate Key instead of unique key. Surrogate key is mostly used for retrieving data faster. It uses Index to perform the retrieval operation. DataStage Interview Questions DataStage Interview Questions and Answers Q46. How rejected rows are managed in Datastage? Ans: In the Datastage, the rejected rows are managed through constraints in transformer. We can either place the rejected rows in the properties of a transformer or we can create a temporary storage for rejected rows with the help of REJECTED command. Q47. Differentiate between ODBC and DRS stage? Ans: DRS stage is faster than the ODBC stage because it uses native databases for connectivity. Q48. Define Orabulk and BCP stages? Ans: Orabulk stage is used to load large amount of data in one target table of Oracle database. The BCP stage is used to load large amount of data in one target table of Microsoft SQL Server. Q49. Define DS Designer? Ans: The DS Designer is used to design work area and add various links to it. Q50. Why do we use Link Partitioner and Link Collector in Datastage? Ans: In Datastage, Link Partitioner is used to divide data into different parts through certain partitioning methods. Link Collector is used to gather data from various partitions/segments to a single data and save it in the target table. More questions Q51. How did you handle reject data? Ans: Typically a Reject-link is defined and the rejected data is loaded back into data warehouse. So Reject link has to be defined every Output link you wish to collect rejected data. Rejected data is typically bad data like duplicates of Primary keys or null-rows where data is expected. Q52. If worked with DS6.0 and latest versions what are Link-Partitioner and Link-Collector used for? Ans: Link Partitioner - Used for partitioning the data. Link Collector - Used for collecting the partitioned data. Q53. What are Routines and where/how are they written and have you written any routines before? Ans: Routines are stored in the Routines branch of the DataStage Repository, where you can create, view or edit. The following are different types of routines: 1) Transform functions 2) Before-after job subroutines 3) Job Control routines Q54. What are OConv () and Iconv () functions and where are they used? Ans: IConv() - Converts a string to an internal storage format OConv() - Converts an expression to an output format. Q55. How did you connect to DB2 in your last project? Ans: Using DB2 ODBC drivers. Q56. Explain METASTAGE? Ans: MetaStage is used to handle the Metadata which will be very useful for data lineage and data analysis later on. Meta Data defines the type of data we are handling. This Data Definitions are stored in repository and can be accessed with the use of MetaStage. Q57. Do you know about INTEGRITY/QUALITY stage? Ans: Qulaity Stage can be integrated with DataStage, In Quality Stage we have many stages like investigate, match, survivorship like that so that we can do the Quality related works and we can integrate with datastage we need Quality stage plugin to achieve the task. Q58. Explain the differences between Oracle8i/9i? Ans: Oracle 8i does not support pseudo column sysdate but 9i supports Oracle 8i we can create 256 columns in a table but in 9i we can upto 1000 columns(fields) Q59. How do you merge two files in DS? Ans: Either use Copy command as a Before-job subroutine if the metadata of the 2 files are same or create a job to concatenate the 2 files into one if the metadata is different. Q60. What is DS Designer used for? Ans: You use the Designer to build jobs by creating a visual design that models the flow and transformation of data from the data source through to the target warehouse. The Designer graphical interface lets you select stage icons, drop them onto the Designer work area, and add links. Q61.What is DS Administrator used for? Ans: The Administrator enables you to set up DataStage users, control the purging of the Repository, and, if National Language Support (NLS) is enabled, install and manage maps and locales. Q62. What is DS Director used for? Ans: datastage director is used to run the jobs and validate the jobs. we can go to datastage director from datastage designer it self. Q63.What is DS Manager used for? Ans: The Manager is a graphical tool that enables you to view and manage the contents of the DataStage Repository Q64. What are Static Hash files and Dynamic Hash files? Ans: As the names itself suggest what they mean. In general we use Type-30 dynamic Hash files. The Data file has a default size of 2Gb and the overflow file is used if the data exceeds the 2GB size. Q65. What is Hash file stage and what is it used for? Ans: Used for Look-ups. It is like a reference table. It is also used in-place of ODBC, OCI tables for better performance. Q66. How are the Dimension tables designed? Ans: Find where data for this dimension are located. Figure out how to extract this data. Determine how to maintain changes to this dimension. Change fact table and DW population routines. Q67. Does the selection of 'Clear the table and Insert rows' in the ODBC stage send a Truncate statement to the DB or does it do some kind of Delete logic. Ans: There is no TRUNCATE on ODBC stages. It is Clear table blah blah and that is a delete from statement. On an OCI stage such as Oracle, you do have both Clear and Truncate options. They are radically different in permissions (Truncate requires you to have alter table permissions where Delete doesn't). Q68. Tell me one situation from your last project, where you had faced problem and How did you solve it? Ans: The jobs in which data is read directly from OCI stages are running extremely slow. I had to stage the data before sending to the transformer to make the jobs run faster. B. The job aborts in the middle of loading some 500,000 rows. Have an option either cleaning/deleting the loaded data and then run the fixed job or run the job again from the row the job has aborted. To make sure the load is proper we opted the former. Q69. Why do we have to load the dimensional tables first, then fact tables: Ans: As we load the dimensional tables the keys (primary) are generated and these keys (primary) are Foreign keys in Fact tables. Q70. How will you determine the sequence of jobs to load into data warehouse? Ans: First we execute the jobs that load the data into Dimension tables, then Fact tables, then load the Aggregator tables (if any). Q71. What are the command line functions that import and export the DS jobs? Ans: A. dsimport.exe- imports the DataStage components. B. dsexport.exe- exports the DataStage components. Q72. What is the utility you use to schedule the jobs on a UNIX server other than using Ascential Director? Ans: Use crontab utility along with dsexecute() function along with proper parameters passed. Q73. How would call an external Java function which are not supported by DataStage? Ans: Starting from DS 6.0 we have the ability to call external Java functions using a Java package from Ascential. In this case we can even use the command line to invoke the Java function and write the return values from the Java program (if any) and use that files as a source in DataStage job. Q74. What will you in a situation where somebody wants to send you a file and use that file as an input or reference and then run job. Ans: A. Under Windows: Use the 'WaitForFileActivity' under the Sequencers and then run the job. May be you can schedule the sequencer around the time the file is expected to arrive. B. Under UNIX: Poll for the file. Once the file has start the job or sequencer depending on the file. Q75. Read the String functions in DS Ans: Functions like -> sub-string function and ':' -> concatenation operator Syntax: string length ] string Q76. How did you connect with DB2 in your last project? Ans: Most of the times the data was sent to us in the form of flat files. The data is dumped and sent to us. In some cases were we need to connect to DB2 for look-ups as an instance then we used ODBC drivers to connect to DB2 (or) DB2-UDB depending the situation and availability. Certainly DB2-UDB is better in terms of performance as you know the native drivers are always better than ODBC drivers. 'iSeries Access ODBC Driver 9.00.02.02' - ODBC drivers to connect to AS400/DB2. Q77. What are Sequencers? Ans: Sequencers are job control programs that execute other jobs with preset Job parameters. Q78. Differentiate Primary Key and Partition Key? Ans: Primary Key is a combination of unique and not null. It can be a collection of key values called as composite primary key. Partition Key is a just a part of Primary Key Q79. How did you handle an 'Aborted' sequencer? Ans: In almost all cases we have to delete the data inserted by this from DB manually and fix the job and then run the job again. Q80. What versions of DS you worked with? Ans: DS 7.0.2/6.0/5.2 Q81. If worked with DS6.0 and latest versions what are Link-Partitioner and Link-Collector used for? Ans: Link Partitioner - Used for partitioning the data.Link Collector - Used for collecting the partitioned data. Q82. How do you rename all of the jobs to support your new File-naming conventions? Ans: Create a Excel spreadsheet with new and old names. Export the whole project as a dsx. Write a Perl program, which can do a simple rename of the strings looking up the Excel file. Q83. Explain the types of Parallel Processing? Ans: Parallel Processing is broadly classified into 2 types. a) SMP - Symmetrical Multi Processing. b) MPP - Massive Parallel Processing. Q84. Does the selection of 'Clear the table and Insert rows' in the ODBC stage send a Truncate statement to the DB or does it do some kind of Delete logic. Ans: There is no TRUNCATE on ODBC stages. It is Clear table blah blah and that is a delete from statement. On an OCI stage such as Oracle, you do have both Clear and Truncate options. Q85. When should we use ODS? Ans: DWH's are typically read only, batch updated on a scheduleODS's are maintained in more real time, trickle fed constantly Q86. What is the default cache size? How do you change the cache size if needed? Ans: Default cache size is 256 MB. We can incraese it by going into Datastage Administrator and selecting the Tunable Tab and specify the cache size over there. Q87. What are the types of Parallel Processing? Ans: Parallel Processing is broadly classified into 2 types. a) SMP - Symmetrical Multi Processing. b) MPP - Massive Parallel Processing. Q88. How to handle Date convertions in Datastage ? Convert a mm/dd/yyyy format to yyyy-dd-mm? Ans: We use a) "Iconv" function - Internal Convertion. b) "Oconv" function - External Convertion. Function to convert mm/dd/yyyy format to yyyy-dd-mm is Oconv(Iconv(Filedname,"D/M Q89. Differentiate Primary Key and Partition Key? Ans: Primary Key is a combination of unique and not null. It can be a collection of key values called as composite primary key. Partition Key is a just a part of Primary Key. Q90. Is it possible to calculate a hash total for an EBCDIC file and have the hash total stored as EBCDIC using Datastage ? Ans: Currently, the total is converted to ASCII, even tho the individual records are stored as EBCDIC. Q91. How do you merge two files in DS? Ans: Either used Copy command as a Before-job subroutine if the metadata of the 2 files are same or created a job to concatenate the 2 files into one if the metadata is different. Q92. How did you connect to DB2 in your last project? Ans: Using DB2 ODBC drivers. Q93. What is the default cache size? How do you change the cache size if needed? Ans: Default cache size is 256 MB. We can incraese it by going into Datastage Administrator and selecting the Tunable Tab and specify the cache size over there. Q94. What are Sequencers? Ans: Sequencers are job control programs that execute other jobs with preset Job parameters. Q95. How do you execute Datastage job from command line prompt? Ans: Using "dsjob" command as follows. dsjob -run -jobstatus projectname jobname Q96. How do you rename all of the jobs to support your new File-naming conventions? Ans: Create a Excel spreadsheet with new and old names. Export the whole project as a dsx. Write a Perl program, which can do a simple rename of the strings looking up the Excel file. Then import the new dsx file probably into a new project for testing. Recompile all jobs. Be cautious that the name of the jobs has also been changed in your job control jobs or Sequencer jobs. So you have to make the necessary changes to these Sequencers. contact for more on DataStage Online Training
Continue reading
Dell Boomi Interview Questions
Q.How Do I Access Atmosphere? Ans: Because the AtmoSphere is an online service, there is no appliance or software to buy, install or maintain. Just point your browser to the login page at http://www.boomi.com and login. Q.How Do I Sign Up For Atmosphere? Can I Download A Demo? Ans: You can sign up for a free trial under the ’30 Day Free Trial’ section of the Boomi website. Q.What Applications Can I Integrate Using Atmosphere? Ans: An up-to-date list of supported applications can be found on our website. Q. Is Any Training Required To Learn To Use Atmosphere? Ans: AtmoSphere is designed to be user-friendly and anyone with basic IT skills and knowledge of the applications they plan to integrate should be able to build integration processes easily. Our customers have reported that using AtomSphere is similar to using other web-based software. However, Boomi’s Support Team offers weekly training sessions via webinar. Q.How Do I Contact Customer Support? Ans: All of the available options for support are listed in the above question. Your easiest and quickest path to access support is through the Live Chat embedded into Boomi AtomSphere. Q.What Sort Of Skill Set Is Required To Configure Atmosphere? Ans: We aim for our service to be a visual, configuration based approach to integration. You do not need to be a developer to utilize the service; you simply need to understand where the data resides in the source system and where the data needs to be integrated in the destination system. The typical roles that utilize AtomSphere would include Systems Analyst, Application Administrator, or Business Process Engineer. Q.What Platforms Do I Need To Have In Order To Run Atmosphere? Ans: Since Boomi hosts the application, all you need is a computer or an alternative device that can run a Web browser. It doesn’t matter what type of hardware or operating system you’re running. Q.What Involvement Is Required From My Company’s It Department To Set Up My Integration Processes? Ans: Very minimal involvement from your IT department is typically needed. Typical involvement from the IT department would include allowing you access to the source/destination applications or allowing you to install a Boomi Atom to gain access to your on-premise application. Q. Can Boomi’s Customer Support Team Help Me Set Up My Integrations? Ans: We have designed AtomSphere to be largely self-service and our website contains a number of resources to help you including documentation, videos, webinars, and training courses that are free. You also have access to Boomi forums and “chat” support from within AtomSphere. Your support level will determine the availability of these services and specific response times. Consulting services are also always available from our professional services team for a fee. Q.What’s An Atom? Ans: An Atom™ is a lightweight, dynamic runtime engine created with patent-pending technology, Boomi Atoms contain all the components required to execute an integration process. There is a full-featured dashboard to monitor the status and health of all Atoms and integration processes whether they are deployed in the cloud or on-premise. Q.Where Are Atoms Hosted? Ans: Boomi Atoms are completely self-contained and autonomous and can be run on virtually any server. They can be deployed “in the cloud” for SaaS to SaaS integration (e.g. Boomi’s data center, an ISVs data center or a third-party data center such as Amazon) or behind a company’s firewall for SaaS to On-Premise integration. Q.What Is An Integration Process? Ans: The main component in a Boomi integration is the Process. A Process represents a business process- or transaction-level interface between two or more systems. Examples of a Process might be “Salesforce Account Synchronization to accounting system” or “Sales Orders from Company ABC to QuickBooks.”Processes contain a left-to-right series of Shapes connected together like a flow chart to illustrate the steps required to transform, route, and otherwise manipulate the data from source to destination. Q.What Is A Connector? Ans: Connectors get and send data in and out of Processes. They enable communication with the applications or data sources between which data needs to move or, in other words, the “endpoints” of the Process. Those applications and data sources can range from traditional on-premise applications like SAP and QuickBooks to Web-based applications like Salesforce.com and NetSuite to data repositories like an FTP directory, a commercial database, or even an email server. Q.How Does Boomi Differ From An Application Programming Interface (API)? Ans: An API opens up secure access to data in an application but it does not accomplish the integration itself. An API is like an electrical socket – until something is plugged into it, it just sits there. Boom integration Connectors are like “plugs.” Boomi Connectors plug into and API and abstract the technical details of the API and transportation protocols used to communicate with various applications and data sources, allowing you to focus on the business data and logic of the integration. A Connector is really a combination of two Components: A Connection and an Operation. Think of the Connection as the where and the Operation as the how. These two components determine the type of data source, how to physically connect to it, and what type of data records to exchange. Q.Are There Any Limitations To The Kind/amount Of Information Being Integrated? Ans: No, we have benchmarked the Boomi Atom to be able to handle very large volumes, upwards of 1,000,000 records an hour. Q.How Often Would We Need To Run The Integration? How Close To Real Time Information Can I Get? Ans: We support both real-time event-based and schedule-driven executions. We have a scheduler built into Boomi AtomSphere. You can schedule an integration to run based on intervals you define (up to every 1 minute) or on an advanced schedule (more flexible). We also have an external API that will allow you to call an integration to be run in real-time from an external source or application. Q. Are The Integrations Manageable By Either Event Or Specific Dates? Ans: Yes, our system will allow you to schedule your integration process to run at specific dates/intervals, up to every one minute. We also provide an API that will allow you to include event-driven integration into your integration process. Q.Does Atmosphere Integration With Shopping Carts & E-commerce Functionality? Ans: Yes, please refer to our website for a full list of supported applications. Q.If Boomi’s Platform Is Hosted In “the Cloud”, How Can I Integrate My On-premise Data And Legacy Applications? Ans: We offer the ability to deploy a Boomi Atom behind your firewall. This Boomi Atom is the run time engine that gives you secure access to your on-premise application without having to make any changes to your firewall. Q.How Do You Ensure The Data Is Secure During The Integration Process? Ans: Boomi AtomSphere Connectors go through application-specific security reviews where applicable. All data that is passed between the Boomi Atom onsite and our data center is sent over a secure HTTPs channel with 128-bit encryption. Learn more about AtomSphere’s security (LINK) Q.How Is Error Handling Managed? Ans: Error handling is managed via our ‘management’ tab is where users can see the integration process, its executions and all associated log and status notifications. Boomi AtomSphere also includes retry capabilities to ensure messages that had an error during transit are delivered; an Atom also tracks its state to ensure that only unique data is processed. Finally, decision logic can be configured to query destination applications to ensure duplicate data is not sent to the application. Q.If I Have On-premise Sources, How Do I Test My Integration Process In The Hosted Environment? Do I Have To Deploy An Atom To Do My Testing? Ans: Yes, the Boomi Atom would reside onsite, allowing you access to the on-premise application through Boomi AtomSphere. Q.Does The Internet And/or Atmosphere Need To Be Up For My Atom To Run? Ans: Yes, because the Boomi Atom that resides onsite has no GUI, it must be in fairly constant contact with the data center. One important design aspect of AtmoSphere is that, much like the Internet itself, it is a distributed architecture, eliminating single points of failure. It is important to note that even during planned maintenance of the platform, deployed Atoms continue to run and process normally. Q.Do You Have Rollbacks For Changes To An Integration Process? Ans: Yes, we offer version control for our integration process allowing you to rollback to a previous integration process should the need arise. Q. Is Test Mode An Actual Test Of The Process Flow Of The Integration And Is The Destination Getting Updated/changed? Ans: Yes, test mode actually executes the integration process as designed, so the source and destination will get updated. Boomi AtomSphere provides the concept of ‘Environments’ for those that wish to have the same integration process pointed to different locations (ie. Test, QA, Production) Contact for more On Dell Boomi Online Training
Continue reading
ETL Testing Interview Questions
Q.What is ETL? Ans: ETL - extract, transform, and load. Extracting data from outside source systems.Transforming raw data to make it fit for use by different departments. Loading transformed data into target systems like data mart or data warehouse. Q.Why ETL testing is required? Ans: • To verify the correctness of data transformation against the signed off business requirements and rules. • To verify that expected data is loaded into data mart or data warehouse without loss of any data. • To validate the accuracy of reconciliation reports (if any e.g. in case of comparison of report of transactions made via bank ATM – ATM report vs. Bank Account Report). • To make sure complete process meet performance and scalability requirements • Data security is also sometimes part of ETL testing • To evaluate the reporting efficiency Q.What is Data warehouse? Ans:Data warehouse is a database used for reporting and data analysis. Q. What are the characteristics of a Data Warehouse? Ans: Subject Oriented, Integrated, Time-variant and Non-volatile Q. What is the difference between Data Mining and Data Warehousing? Ans. Data mining - analyzing data from different perspectives and concluding it into useful decision making information. It can be used to increase revenue, cost cutting, increase productivity or improve any business process. There are lot of tools available in market for various industries to do data mining. Basically, it is all about finding correlations or patterns in large relational databases. Data warehousing comes before data mining. It is the process of compiling and organizing data into one database from various source systems where as data mining is the process of extracting meaningful data from that database (data warehouse). Q.What Is The Difference Between Etl Tool And Olap Tools? Answer : ETL tool is meant for extraction data from the legacy systems and load into specified database with some process of cleansing data. ex: Informatica, data stage ....etc OLAP is meant for Reporting purpose in OLAP data available in Multidirectional model. so that you can write simple query to extract data from the data base. ex: Business objects, Cognos....etc Q.Can We Lookup A Table From Source Qualifier Transformation. Ie. Unconnected Lookup? Answer : You cannot lookup from a source qualifier directly. However, you can override the SQL in the source qualifier to join with the lookup table to perform the lookup. Q.What Is Ods (operation Data Source)? Answer : ODS - Operational Data Store. ODS Comes between staging area & Data Warehouse. The data is ODS will be at the low level of granularity. Once data was populated in ODS aggregated data will be loaded into EDW through ODS. Q.Where Do We Use Connected And Unconnected Lookups? Answer : If return port only one then we can go for unconnected. More than one return port is not possible with Unconnected. If more than one return port then go for Connected. If you require dynamic cache i.e where your data will change dynamically then you can go for connected lookup. If your data is static where your data won't change when the session loads you can go for unconnected lookups . Q. What are the main stages of Business Intelligence? Ans: Data Sourcing –> Data Analysis –> Situation Awareness –> Risk Assessment –> Decision Support Q. What tools you have used for ETL testing? Ans. 1. Data access tools e.g., TOAD, WinSQL, AQT etc. (used to analyze content of tables) 2. ETL Tools e.g. Informatica, DataStage 3. Test management tool e.g. Test Director, Quality Center etc. ( to maintain requirements, test cases, defects and traceability matrix) Benefits of ETL Testing Production Reconciliation IT Developer Productivity Data Integrity Q.What is a Data Warehouse? Ans: A Data Warehouse is a collection of data marts representing historical data from different operational data source (OLTP). The data from these OLTP are structured and optimized for querying and data analysis in a Data Warehouse. Q.What is a Data mart? Ans: A Data Mart is a subset of a data warehouse that can provide data for reporting and analysis on a section, unit or a department like Sales Dept, HR Dept, etc. The Data Mart are sometimes also called as HPQS (Higher Performance Query Structure). Q.What is OLAP? Ans: OLAP stands for Online Analytical Processing. It uses database tables (Fact and Dimension tables) to enable multidimensional viewing, analysis and querying of large amount of data. Q.What is OLTP? Ans: OLTP stands for Online Transaction Processing Except data warehouse databases the other databases are OLTPs. These OLTP uses normalized schema structure. These OLTP databases are designed for recording the daily operations and transactions of a business. Q.What are Dimensions? Ans: Dimensions are categories by which summarized data can be viewed. For example a profit Fact table can be viewed by a time dimension. Q.Give Some Etl Tool Functionalities? Answer : While the selection of a database and a hardware platform is a must, the selection of an ETL tool is highly recommended, but it's not a must. When you evaluate ETL tools, it pays to look for the following characteristics: Functional capability:This includes both the 'transformation' piece and the 'cleansing' piece. In general, the typical ETL tools are either geared towards having strong transformation capabilities or having strong cleansing capabilities, but they are seldom very strong in both. As a result, if you know your data is going to be dirty coming in, make sure your ETL tool has strong cleansing capabilities. If you know there are going to be a lot of different data transformations, it then makes sense to pick a tool that is strong in transformation. Ability to read directly from your data source:For each organization, there is a different set of data sources. Make sure the ETL tool you select can connect directly to your source data. Metadata support:The ETL tool plays a key role in your metadata because it maps the source data to the destination, which is an important piece of the metadata. In fact, some organizations have come to rely on the documentation of their ETL tool as their metadata source. As a result, it is very important to select an ETL tool that works with your overall metadata strategy. contact for more on Etl Testing Online Training
Continue reading
Hadoop Cluster Interview Questions
Q.Explain About The Hadoop-core Configuration Files? Ans: Hadoop core is specified by two resources. It is configured by two well written xml files which are loaded from the classpath: Hadoop-default.xml- Read-only defaults for Hadoop, suitable for a single machine instance. Hadoop-site.xml- It specifies the site configuration for Hadoop distribution. The cluster specific information is also provided by the Hadoop administrator. Q.Explain In Brief The Three Modes In Which Hadoop Can Be Run? Ans : The three modes in which Hadoop can be run are: Standalone (local) mode- No Hadoop daemons running, everything runs on a single Java Virtual machine only. Pseudo-distributed mode- Daemons run on the local machine, thereby simulating a cluster on a smaller scale. Fully distributed mode- Runs on a cluster of machines. Q.Explain What Are The Features Of Standalone (local) Mode? Ans : In stand-alone or local mode there are no Hadoop daemons running, and everything runs on a single Java process. Hence, we don't get the benefit of distributing the code across a cluster of machines. Since, it has no DFS, it utilizes the local file system. This mode is suitable only for running MapReduce programs by developers during various stages of development. Its the best environment for learning and good for debugging purposes. Q.What Are The Features Of Fully Distributed Mode? Ans:In Fully Distributed mode, the clusters range from a few nodes to 'n' number of nodes. It is used in production environments, where we have thousands of machines in the Hadoop cluster. The daemons of Hadoop run on these clusters. We have to configure separate masters and separate slaves in this distribution, the implementation of which is quite complex. In this configuration, Namenode and Datanode runs on different hosts and there are nodes on which task tracker runs. The root of the distribution is referred as HADOOP_HOME. Q.Explain What Are The Main Features Of Pseudo Mode? Ans : In Pseudo-distributed mode, each Hadoop daemon runs in a separate Java process, as such it simulates a cluster though on a small scale. This mode is used both for development and QA environments. Here, we need to do the configuration changes. Q.What Are The Hadoop Configuration Files At Present? Ans : There are 3 configuration files in Hadoop: conf/core-site.xml: fs.default.name hdfs: //localhost:9000 conf/hdfs-site.xml: dfs.replication 1 conf/mapred-site.xml: mapred.job.tracker local host: 9001 Q.Can You Name Some Companies That Are Using Hadoop? Ans : Numerous companies are using Hadoop, from large Software Companies, MNCs to small organizations. Yahoo is the top contributor with many open source Hadoop Softwares and frameworks. Social Media Companies like Facebook and Twitter have been using for a long time now for storing their mammoth data. Apart from that Netflix, IBM, Adobe and e-commerce websites like Amazon and eBay are also using multiple Hadoop technologies. Q.Which Is The Directory Where Hadoop Is Installed? Ans : Cloudera and Apache have the same directory structure. Hadoop is installed in cd /usr/lib/hadoop-0.20/. Q.What Are The Port Numbers Of Name Node, Job Tracker And Task Tracker? Ans : The port number for Namenode is ’70′, for job tracker is ’30′ and for task tracker is ’60′. Q.Tell Us What Is A Spill Factor With Respect To The Ram? Ans : Spill factor is the size after which your files move to the temp file. Hadoop-temp directory is used for this. Default value for io.sort.spill.percent is 0.80. A value less than 0.5 is not recommended. Q.Is Fs.mapr.working.for A Single Directory? Ans : Yes, fs.mapr.working.dir it is just one directory. Q.Which Are The Three Main Hdfs-site.xml Properties? Ans : The three main hdfs-site.xml properties are: name.dir which gives you the location on which metadata will be stored and where DFS is located – on disk or onto the remote. data.dir which gives you the location where the data is going to be stored. checkpoint.dir which is for secondary Namenode. Q.How To Come Out Of The Insert Mode? Ans : To come out of the insert mode, press ESC, Type: q (if you have not written anything) OR Type: wq (if you have written anything in the file) and then press ENTER. Q.Tell Us What Cloudera Is And Why It Is Used In Big Data? Ans : Cloudera is the leading Hadoop distribution vendor on the Big Data market, its termed as the next-generation data management software that is required for business critical data challenges that includes access, storage, management, business analytics, systems security, and search. Q.We Are Using Ubuntu Operating System With Cloudera, But From Where We Can Download Hadoop Or Does It Come By Default With Ubuntu? Ans : This is a default configuration of Hadoop that you have to download from Cloudera or from eureka’s Dropbox and the run it on your systems. You can also proceed with your own configuration but you need a Linux box, be it Ubuntu or Red hat. There are installations steps present at the Cloudera location or in Eureka’s Drop box. You can go either ways. Q.What Is The Main Function Of The ‘jps’ Command? Ans : The jps’ command checks whether the Datanode, Namenode, tasktracker, jobtracker, and other components are working or not in Hadoop. One thing to remember is that if you have started Hadoop services with sudo then you need to run JPS with sudo privileges else the status will be not shown. Q.How Can I Restart Namenode? Ans : Click on stop-all.sh and then click on start-all.sh OR Write sudo hdfs (press enter), su-hdfs (press enter), /etc/init.d/ha (press enter) and then /etc/init.d/hadoop-0.20-namenode start (press enter). Q.How Can We Check Whether Namenode Is Working Or Not? Ans : To check whether Namenode is working or not, use the command /etc/init.d/hadoop- 0.20-namenode status or as simple as jps’. Q.What Is "fsck" And What Is Its Use? Ans : "fsck" is File System Check. FSCK is used to check the health of a Hadoop Filesystem. It generates a summarized report of the overall health of the filesystem. Usage: hadoop fsck / Q.At Times You Get A ‘connection Refused Java Exception’ When You Run The File System Check Command Hadoop Fsck /? Ans : The most possible reason is that the Namenode is not working on your VM. Q.What Is The Use Of The Command Mapred.job.tracker? Ans : The command mapred.job.tracker is used by the Job Tracker to list out which host and port that the MapReduce job tracker runs at. If it is "local", then jobs are run in-process as a single map and reduce task. Q.What Does /etc /init.d Do? Ans : /etc /init.d specifies where daemons (services) are placed or to see the status of these daemons. It is very LINUX specific, and nothing to do with Hadoop. Q.How Can We Look For The Namenode In The Browser? Ans : If you have to look for Namenode in the browser, you don’t have to give localhost: 8021, the port number to look for Namenode in the browser is 50070. Q.How To Change From Su To Cloudera? Ans : To change from SU to Cloudera just type exit. Q.Which Files Are Used By The Startup And Shutdown Commands? Ans : Slaves and Masters are used by the startup and the shutdown commands. Q.What Do Masters And Slaves Consist Of? Ans : Masters contain a list of hosts, one per line, that are to host secondary namenode servers. Slaves consist of a list of hosts, one per line, that host datanode and task tracker servers. Q.What Is The Function Of Hadoop-env.sh? Where Is It Present? Ans : This file contains some environment variable settings used by Hadoop; it provides the environment for Hadoop to run. The path of JAVA_HOME is set here for it to run properly. Hadoop-env.sh file is present in the conf/hadoop-env.sh location. You can also create your own custom configuration file conf/hadoop-user-env.sh, which will allow you to override the default Hadoop settings. Q.Can We Have Multiple Entries In The Master Files? Ans : Yes, we can have multiple entries in the Master files. Q.In Hadoop_pid_dir, What Does Pid Stands For? Ans : PID stands for ‘Process ID’. Q.What Does Hadoop-metrics? Properties File Do? Ans : Hadoop-metrics Properties is used for ‘Reporting‘purposes. It controls the reporting for hadoop. The default status is ‘not to report‘. Q.What Are The Network Requirements For Hadoop? Ans : The Hadoop core uses Shell (SSH) to launch the server processes on the slave nodes. It requires password-less SSH connection between the master and all the slaves and the Secondary machines. Q.Why Do We Need A Password-less Ssh In Fully Distributed Environment? Ans : We need a password-less SSH in a Fully-Distributed environment because when the cluster is LIVE and running in Fully Distributed environment, the communication is too frequent. The job tracker should be able to send a task to task tracker quickly. Q.What Will Happen If A Namenode Has No Data? Ans : If a Namenode has no data it cannot be considered as a Namenode. In practical terms, Namenode needs to have some data. Q.What Happens To Job Tracker When Namenode Is Down? Ans : Namenode is the main point which keeps all the metadata, keep tracks of failure of datanode with the help of heart beats. As such when a namenode is down, your cluster will be completely down, because Namenode is the single point of failure in a Hadoop Installation. Q.Explain What Do You Mean By Formatting Of The Dfs? Ans : Like we do in Windows, DFS is formatted for proper structuring of data. It is not usually recommended to do as it format the Namenode too in the process, which is not desired. Q.We Use Unix Variants For Hadoop. Can We Use Microsoft Windows For The Same? Ans : In practicality, Ubuntu and Red Hat Linux are the best Operating Systems for Hadoop. On the other hand, Windows can be used but it is not used frequently for installing Hadoop as there are many support problems related to it. The frequency of crashes and the subsequent restarts makes it unattractive. As such, Windows is not recommended as a preferred environment for Hadoop Installation, though users can give it a try for learning purposes in the initial stage. Q.Which One Decides The Input Split - Hdfs Client Or Namenode? Ans : The HDFS Client does not decide. It is already specified in one of the configurations through which input split is already configured. Q.Let’s Take A Scenario, Let’s Say We Have Already Cloudera In A Cluster, Now If We Want To Form A Cluster On Ubuntu Can We Do It. Explain In Brief? Ans : Yes, we can definitely do it. We have all the useful installation steps for creating a new cluster. The only thing that needs to be done is to uninstall the present cluster and install the new cluster in the targeted environment. Q.Can You Tell Me If We Can Create A Hadoop Cluster From Scratch? Ans : Yes, we can definitely do that. Once we become familiar with the Apache Hadoop environment, we can create a cluster from scratch. Q.Explain The Significance Of Ssh? What Is The Port On Which Port Does Ssh Work? Why Do We Need Password In Ssh Local Host? Ans : SSH is a secure shell communication, is a secure protocol and the most common way of administering remote servers safely, relatively very simple and inexpensive to implement. A single SSH connection can host multiple channels and hence can transfer data in both directions. SSH works on Port No. 22, and it is the default port number. However, it can be configured to point to a new port number, but its not recommended. In local host, password is required in SSH for security and in a situation where password less communication is not set. Q.What Is Ssh? Explain In Detail About Ssh Communication Between Masters And The Slaves? Ans : Secure Socket Shell or SSH is a password-less secure communication that provides administrators with a secure way to access a remote computer and data packets are sent across the slave. This network protocol also has some format into which data is sent across. SSH communication is not only between masters and slaves but also between two hosts in a network. SSH appeared in 1995 with the introduction of SSH - 1. Now SSH 2 is in use, with the vulnerabilities coming to the fore when Edward Snowden leaked information by decrypting some SSH traffic. Q.Can You Tell Is What Will Happen To A Namenode, When Job Tracker Is Not Up And Running? Ans : When the job tracker is down, it will not be in functional mode, all running jobs will be halted because it is a single point of failure. Your whole cluster will be down but still Namenode will be present. As such the cluster will still be accessible if Namenode is working, even if the job tracker is not up and running. But you cannot run your Hadoop job.
Continue reading
Hadoop Interview Questions
Q.What is BIG Data? Ans: Big Data is nothing but an assortment of such a huge and complex data that it becomes very tedious to capture, store, process, retrieve and analyze it with the help of on-hand database management tools or traditional data processing techniques. Q.What is Hadoop framework? Ans: Hadoop is an open source framework which is written in java by apache software foundation. This framework is used to write software application which requires to process vast amount of data (It could handle multi tera bytes of data). It works in-parallel on large clusters which could have 1000 of computers (Nodes) on the clusters. It also process data very reliably and fault-tolerant manner. See the below image how does it looks. Q.On What concept the Hadoop framework works? Ans: It works on MapReduce, and it is devised by the Google. Q.What is MapReduce? Ans: Map reduces is an algorithm or concept to process Huge amount of data in a faster way. As per its name you can divide it Map and Reduce. The main MapReduce job usually splits the input data-set into independent chunks. (Big data sets in the multiple small datasets) Reduce Task: And the above output will be the input for the reduce tasks, produces the final result. Your business logic would be written in the Mapped Task and Reduced Task. Typically both the input and the output of the job are stored in a file-system (Not database). The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. Q.What is compute and Storage nodes? Ans: Compute Node: This is the computer or machine where your actual business logic will be executed. Storage Node: This is the computer or machine where your file system resides to store the processing data. In most of the cases compute node and storage node would be the same machine. Q.How does master slave architecture in the Hadoop? Ans: the MapReduce framework consists of a single master Job Tracker and multiple slaves, each cluster-node will have one Task Tracker. The master is responsible for scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as directed by the master. Q.How does a Hadoop application look like or their basic components? Ans: Minimally a Hadoop application would have following components. Input location of data Output location of processed data. A map task. A reduced task. Job configuration The Hadoop job client then submits the job (jar/executable etc.) and configuration to the Job Tracker which then assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client. Q.Explain how input and output data format of the Hadoop framework? Ans: The MapReduce framework operates exclusively on pairs, that is, the framework views the input to the job as a set of pairs and produces a set of pairs as the output of the job, conceivably of different types. See the flow mentioned below (input) -> map -> -> combine/sorting -> -> reduce -> (output) Q.What are the restriction to the key and value class? Ans: The key and value classes have to be serialized by the framework. To make them serializable Hadoop provides a Writable interface. As you know from the java itself that the key of the Map should be comparable, hence the key has to implement one more interface Writable Comparable. Q.Explain the Word Count implementation via Hadoop framework? Ans: We will count the words in all the input file flow as below Input Assume there are two files each having a sentence Hello World Hello World (In file 1) Hello World Hello World (In file 2) Mapper: There would be each mapper for the a file For the given sample input the first map output: < Hello, 1> World, 1> Hello, 1> World, 1> Hello, 1> World, 1> Hello, 1> World, 1> Combiner/Sorting (This is done for each individual map) So output looks like this The output of the first map: < Hello, 2> < World, 2> The output of the second map: < Hello, 2> < World, 2> Reducer: Output It sums up the above output and generates the output as below Hello, 4> World, 4> Final output would look like Hello 4 times World 4 times Q.Which interface needs to be implemented to create Mapper and Reducer for the Hadoop? Ans: org.apache.hadoop.mapreduce.Mapper org.apache.hadoop.mapreduce.Reducer Q.What Mapper does? Ans: Maps are the individual tasks that transform input records into intermediate records. The transformed intermediate records do not need to be of the same type as the input records. A given input pair may map to zero or many output pairs. Q.What is the Input Split in map reduce software? Ans: An Input Split is a logical representation of a unit (A chunk) of input work for a map task; e.g., a filename and a byte range within that file to process or a row set in a text file. Q.What is the Input Format? Ans: The Input Format is responsible for enumerate (itemize) the Input Split, and producing a Record Reader which will turn those logical work units into actual physical input records. Q.Where do you specify the Mapper Implementation? Ans: Generally mapper implementation is specified in the Job itself. Q.How Mapper is instantiated in a running job? Ans: The Mapper itself is instantiated in the running job, and will be passed a Map Context object which it can use to configure itself Q.Which are the methods in the Mapper interface? Ans: the Mapper contains the run () method, which call its own setup () method only once, it also call a map () method for each input and finally calls it cleanup () method. All above methods you can override in your code. Q.What happens if you don’t override the Mapper methods and keep them as it is? Ans: If you do not override any methods (leaving even map as-is), it will act as the identity function, emitting each input record as a separate output. Q.What is the use of Context object? Ans: The Context object allows the mapper to interact with the rest of the Hadoop system. It Includes configuration data for the job, as well as interfaces which allow it to emit output. Q.How can you add the arbitrary key-value pairs in your mapper? Ans: You can set arbitrary (key, value) pairs of configuration data in your Job, e.g. with Job.getConfiguration ().set ("myKey", "myVal"), and then retrieve this data in your mapper with context.getConfiguration ().get ("myKey"). This kind of functionality is typically done in the Mapper's setup () method. Q.How does Mapper’s run () method works? Ans: The Mapper. Run () method then calls map (KeyInType, ValInType, Context) for each key/value pair in the Input Split for that task Q.Which object can be used to get the progress of a particular job? Ans: Context Q.What is next step after Mapper or MapTask? Ans: The output of the Mapper is sorted and Partitions will be created for the output. Number of partition depends on the number of reducer. Q.How can we control particular key should go in a specific reducer? Ans: Users can control which keys (and hence records) go to which Reducer by implementing a custom Partitioner. Q.What is the use of Combiner? Ans: It is an optional component or class, and can be specify via Job.setCombinerClass (Class Name), to perform local aggregation of the intermediate outputs, which helps to cut down the amount of data transferred from the Mapper to the Reducer. Q.How many maps are there in a particular Job? Ans: the number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files. Generally it is around 10-100 maps per-node. Task setup takes awhile, so it is best if the maps take at least a minute to execute. Suppose, if you expect 10TB of input data and have a block size of 128MB, you'll end up with 82,000 maps, to control the number of block you can use the mapreduce.job.maps parameter (which only provides a hint to the framework). Ultimately, the number of tasks is controlled by the number of splits returned by the InputFormat.getSplits () method (which you can override). Q.What is the Reducer used for? Ans: Reducer reduces a set of intermediate values which share a key to a (usually smaller) set of values. The number of reduces for the job is set by the user via Job.setNumReduceTasks (int). Q.Explain the core methods of the Reducer? Ans: The API of Reducer is very similar to that of Mapper, there's a run() method that receives a Context containing the job's configuration as well as interfacing methods that return data from the reducer itself back to the framework. The run() method calls setup() once, reduce() once for each key associated with the reduce task, and cleanup() once at the end. Each of these methods can access the job's configuration data by using Context.getConfiguration (). As in Mapper, any or all of these methods can be overridden with custom implementations. If none of these methods are overridden, the default reducer operation is the identity function; values are passed through without further processing. The heart of Reducer is it’s reduce () method. This is called once per key; the second argument is an Iterable which returns all the values associated with that key. Q.What are the primary phases of the Reducer? Ans: Shuffle, Sort and Reduce Q.Explain the shuffle? Ans: Input to the Reducer is the sorted output of the mappers. In this phase the framework fetches the relevant partition of the output of all the mappers, via HTTP. Q.Explain the Reducer’s Sort phase? Ans: The framework groups Reducer inputs by keys (since different mappers may have output the same key) in this stage. The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged (It is similar to merge-sort). Q.Explain the Reducer’s reduce phase? Ans: In this phase the reduce (MapOutKeyType, Iterable, Context) method is called for each pair in the grouped inputs. The output of the reduce task is typically written to the File System via Context. write (ReduceOutKeyType, ReduceOutValType). Applications can use the Context to report progress, set application-level status messages and update Counters, or just indicate that they are alive. The output of the Reducer is not sorted. Q.How many Reducers should be configured? Ans: The right number of reduces seems to be 0.95 or 1.75 multiplied by ( * mapreduce.tasktracker.reduce.tasks.maximum). With 0.95 all of the reduces can launch immediately and start transferring map outputs as the maps finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing. Increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures. Q.It can be possible that a Job has 0 reducers? Ans: It is legal to set the number of reduce-tasks to zero if no reduction is desired. Q.What happens if number of reducers are 0? Ans: In this case the outputs of the map-tasks go directly to the FileSystem, into the output path set by setOutputPath (Path). The framework does not sort the map-outputs before writing them out to the FileSystem. Q.How many instances of Job Tracker can run on a Hadoop Cluster? Ans: Only one Q.What is the Job Tracker and what it performs in a Hadoop Cluster? Ans: Job Tracker is a daemon service which submits and tracks the MapReduce tasks to the Hadoop cluster. It runs its own JVM process. And usually it run on a separate machine and each slave node is configured with job tracker node location. The Job Tracker is single point of failure for the Hadoop MapReduce service. If it goes down, all running jobs are halted. Job Tracker in Hadoop performs following actions Client applications submit jobs to the Job tracker. The Job Tracker talks to the Name Node to determine the location of the data The Job Tracker locates Task Tracker nodes with available slots at or near the data TheJob Tracker submits the work to the chosen Task Tracker nodes. The Task Tracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different Task Tracker. A Task Tracker will notify the Job Tracker when a task fails. The Job Tracker decides what to do then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may even blacklist the Task Tracker as unreliable. When the work is completed, the Job Tracker updates its status. Client applications can poll the Job Tracker for information. Q.How a task is scheduled by a Job Tracker? Ans: The Task Trackers send out heartbeat messages to the Job Tracker, usually every few minutes, to reassure the Job Tracker that it is still alive. These messages also inform the Job Tracker of the number of available slots, so the Job Tracker can stay up to date with where in the cluster work can be delegated. When the Job Tracker tries to find somewhere to schedule a task within the MapReduce operations, it first looks for an empty slot on the same server that hosts the Data Node containing the data, and if not, it looks for an empty slot on a machine in the same rack. Q.How many instances of Task tracker run on a Hadoop cluster? Ans: There is one Daemon Task tracker process for each slave node in the Hadoop cluster. Q.What are the two main parts of the Hadoop framework? Ans: Hadoop consists of two main parts Hadoop distributed file system, a distributed file system with high throughput, Hadoop MapReduce, a software framework for processing large data sets. Q.Explain the use of Task Tracker in the Hadoop cluster? Ans: A Task tracker is a slave node in the cluster which that accepts the tasks from Job Tracker like Map, Reduce or shuffle operation. Task tracker also runs in its own JVM Process. Every Task Tracker is configured with a set of slots; these indicate the number of tasks that it can accept. The Task Tracker starts a separate JVM processes to do the actual work (called as Task Instance) this is to ensure that process failure does not take down the task tracker. The Task tracker monitors these task instances, capturing the output and exit codes. When the Task instances finish, successfully or not, the task tracker notifies the Job Tracker. The Task Trackers also send out heartbeat messages to the Job Tracker, usually every few minutes, to reassure the Job Tracker that it is still alive. These messages also inform the Job Tracker of the number of available slots, so the Job Tracker can stay up to date with where in the cluster work can be delegated. Q.What do you mean by Task Instance? Ans: Task instances are the actual MapReduce jobs which run on each slave node. The Task Tracker starts a separate JVM processes to do the actual work (called as Task Instance) this is to ensure that process failure does not take down the entire task tracker. Each Task Instance runs on its own JVM process. There can be multiple processes of task instance running on a slave node. This is based on the number of slots configured on task tracker. By default a new task instance JVM process is spawned for a task. Q.How many daemon processes run on a Hadoop cluster? Ans: Hadoop is comprised of five separate daemons. Each of these daemons runs in its own JVM. Following 3 Daemons run on Master Nodes.NameNode - This daemon stores and maintains the metadata for HDFS. Secondary Name Node - Performs housekeeping functions for the Name Node. Job Tracker - Manages MapReduce jobs, distributes individual tasks to machines running the Task Tracker. Following 2 Daemons run on each Slave nodes Data Node – Stores actual HDFS data blocks. Task Tracker – It is Responsible for instantiating and monitoring individual Map and Reduce tasks. Q.How many maximum JVM can run on a slave node? Ans: One or Multiple instances of Task Instance can run on each slave node. Each task instance is run as a separate JVM process. The number of Task instances can be controlled by configuration. Typically a high end machine is configured to run more task instances. Q.What is NAS? Ans: It is one kind of file system where data can reside on one centralized machine and all the cluster member will read write data from that shared database, which would not be as efficient as HDFS. Q.How HDFA differs with NFS? Ans: Following are differences between HDFS and NAS In HDFS Data Blocks are distributed across local drives of all machines in a cluster. Whereas in NAS data is stored on dedicated hardware. HDFS is designed to work with MapReduce System, since computation is moved to data. NAS is not suitable for MapReduce since data is stored separately from the computations HDFS runs on a cluster of machines and provides redundancy using replication protocol. Whereas NAS is provided by a single machine therefore does not provide data redundancy. Q.How does a Name Node handle the failure of the data nodes? Ans: HDFS has master/slave architecture. An HDFS cluster consists of a single Name Node, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. The Name Node and Data Node are pieces of software designed to run on commodity machines. Name Node periodically receives a Heartbeat and a Block report from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the Data Node is functioning properly. A Block report contains a list of all blocks on a Data Node. When Name Node notices that it has not received a heartbeat message from a data node after a certain amount of time, the data node is marked as dead. Since blocks will be under replicated the system begins replicating the blocks that were stored on the dead Data Node. The Name Node orchestrates the replication of data blocks from one Data Node to another. The replication data transfer happens directly between Data Node and the data never passes through the Name Node. Q.Can Reducer talk with each other? Ans: No, Reducer runs in isolation. Q.Where the Mapper’s Intermediate data will be stored? Ans: The mapper output (intermediate data) is stored on the Local file system (NOT HDFS) of each individual mapper nodes. This is typically a temporary directory location which can be setup in config by the Hadoop administrator. The intermediate data is cleaned up after the Hadoop Job completes. Q.What is the use of Combiners in the Hadoop framework? Ans: Combiners are used to increase the efficiency of a MapReduce program. They are used to aggregate intermediate map output locally on individual mapper outputs. Combiners can help you reduce the amount of data that needs to be transferred across to the reducers. You can use your reducer code as a combiner if the operation performed is commutative and associative. The execution of combiner is not guaranteed; Hadoop may or may not execute a combiner. Also, if required it may execute it more than 1 times. Therefore your MapReduce jobs should not depend on the combiners’ execution. Q.What is the Hadoop MapReduce API contract for a key and value Class? Ans: ◦The Key must implement the org.apache.hadoop.io.WritableComparable interface. ◦The value must implement the org.apache.hadoop.io.Writable interface. Q.What is Identity Mapper and Identity Reducer in MapReduce? Ans: ◦ org.apache.hadoop.mapred.lib.IdentityMapper: Implements the identity function, mapping inputs directly to outputs. If MapReduce programmer does not set the Mapper Class using JobConf.setMapperClass then IdentityMapper.class is used as a default value. ◦org.apache.hadoop.mapred.lib.IdentityReducer: Performs no reduction, writing all input values directly to the output. If MapReduce programmer does not set the Reducer Class using JobConf.setReducerClass then IdentityReducer.class is used as a default value. Q.What is the meaning of speculative execution in Hadoop? Why is it important? Ans: Speculative execution is a way of coping with individual Machine performance. In large clusters where hundreds or thousands of machines are involved there may be machines which are not performing as fast as others. This may result in delays in a full job due to only one machine not performing well. To avoid this, speculative execution in Hadoop can run multiple copies of same map or reduce task on different slave nodes. The results from first node to finish are used Q.When the reducers are started in a MapReduce job? Ans: In a MapReduce job reducers do not start executing the reduce method until the all Map jobs have completed. Reducers start copying intermediate key-value pairs from the mappers as soon as they are available. The programmer defined reduce method is called only after all the mappers have finished. If reducers do not start before all mappers finish then why does the progress on MapReduce job shows something like Map (50%) Reduce (10%)? Why reducer’s progress percentage is displayed when mapper is not finished yet? Reducers start copying intermediate key-value pairs from the mappers as soon as they are available. The progress calculation also takes in account the processing of data transfer which is done by reduce process, therefore the reduce progress starts showing up as soon as any intermediate key-value pair for a mapper is available to be transferred to reducer. Though the reducer progress is updated still the programmer defined reduce method is called only after all the mappers have finished. Q.What is HDFS? How it is different from traditional file systems? Ans: HDFS, the Hadoop Distributed File System, is responsible for storing huge data on the cluster. This is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. ◦HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. ◦HDFS provides high throughput access to application data and is suitable for applications that have large data sets. ◦HDFS is designed to support very large files. Applications that are compatible with HDFS are those that deal with large data sets. These applications write their data only once but they read it one or more times and require these reads to be satisfied at streaming speeds. HDFS supports write-once-read-many semantics on files. Q.What is HDFS Block size? How is it different from traditional file system block size? Ans: In HDFS data is split into blocks and distributed across multiple nodes in the cluster. Each block is typically 64Mb or 128Mb in size. Each block is replicated multiple times. Default is to replicate each block three times. Replicas are stored on different nodes. HDFS utilizes the local file system to store each HDFS block as a separate file. HDFS Block size cannot be compared with the traditional file system block size. Q.What is a Name Node? How many instances of Name Node run on a Hadoop Cluster? Ans: The Name Node is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself. There is only One Name Node process run on any Hadoop cluster. Name Node runs on its own JVM process. In a typical production cluster its run on a separate machine. The Name Node is a Single Point of Failure for the HDFS Cluster. When the Name Node goes down, the file system goes offline. Client applications talk to the Name Node whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The Name Node responds the successful requests by returning a list of relevant Data Node servers where the data lives. Q.What is a Data Node? How many instances of Data Node run on a Hadoop Cluster? Ans: A Data Node stores data in the Hadoop File System HDFS. There is only One Data Node process run on any Hadoop slave node. Data Node runs on its own JVM process. On startup, a Data Node connects to the Name Node. Data Node instances can talk to each other, this is mostly during replicating data. Q.How the Client communicates with HDFS? Ans: The Client communication to HDFS happens to be using Hadoop HDFS API. Client applications talk to the Name Node whenever they wish to locate a file, or when they want to add/copy/move/delete a file on HDFS. The Name Node responds the successful requests by returning a list of relevant Data Node servers where the data lives. Client applications can talk directly to a Data Node, once the Name Node has provided the location of the data. Q.How the HDFS Blocks are replicated? Ans: HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are writing-once and have strictly one writer at any time. The Name Node makes all decisions regarding replication of blocks. HDFS uses rack-aware replica placement policy. In default configurations there are total 3 copies of a data block on HDFS, 2 copies are stored on DataNodes on same rack and 3rd copy on a different rack. Hadoop Interview Questions Hadoop Interview Question and Answers Hyperion MapReduce Interview Questions Q.What is MapReduce? Ans: It is a framework or a programming model that is used for processing large data sets over clusters of computers using distributed programming. Q.What are ‘maps’ and ‘reduces’? Ans: ‘Maps‘ and ‘Reduces‘ are two phases of solving a query in HDFS. ‘Map’ is responsible to read data from input location, and based on the input type, it will generate a key value pair, that is, an intermediate output in local machine. ’Reducer’ is responsible to process the intermediate output received from the mapper and generate the final output. Q.What are the four basic parameters of a mapper? Ans: The four basic parameters of a mapper are LongWritable, text, text and IntWritable. The first two represent input parameters and the second two represent intermediate output parameters. Q.What are the four basic parameters of a reducer? Ans: The four basic parameters of a reducer are text, IntWritable, text, IntWritable. The first two represent intermediate output parameters and the second two represent final output parameters. Q.What do the master class and the output class do? Ans: Master is defined to update the Master or the job tracker and the output class is defined to write data onto the output location. Q.What is the input type/format in MapReduce by default? Ans: By default the type input type in MapReduce is ‘text’. Q.Is it mandatory to set input and output type/format in MapReduce? Ans: No, it is not mandatory to set the input and output type/format in MapReduce. By default, the cluster takes the input and the output type as ‘text’. Q.What does the text input format do? Ans: In text input format, each line will create a line object, that is an hexa-decimal number. Key is considered as a line object and value is considered as a whole line text. This is how the data gets processed by a mapper. The mapper will receive the ‘key’ as a ‘LongWritable‘ parameter and value as a ‘text‘ parameter. Q.What does job conf class do? Ans: MapReduce needs to logically separate different jobs running on the same cluster. ‘Job conf class‘ helps to do job level settings such as declaring a job in real environment. It is recommended that Job name should be descriptive and represent the type of job that is being executed. Q.What does conf.setMapper Class do? Ans: Conf.setMapper class sets the mapper class and all the stuff related to map job such as reading a data and generating a key-value pair out of the mapper. Q.What do sorting and shuffling do? Ans: Sorting and shuffling are responsible for creating a unique key and a list of values. Making similar keys at one location is known as Sorting. And the process by which the intermediate output of the mapper is sorted and sent across to the reducers is known as Shuffling. Q.What does a split do? Ans: Before transferring the data from hard disk location to map method, there is a phase or method called the ‘Split Method‘. Split method pulls a block of data from HDFS to the framework. The Split class does not write anything, but reads data from the block and pass it to the mapper. Be default, Split is taken care by the framework. Split method is equal to the block size and is used to divide block into bunch of splits. Q.How can we change the split size if our commodity hardware has less storage space? Ans: If our commodity hardware has less storage space, we can change the split size by writing the ‘custom splitter‘. There is a feature of customization in Hadoop which can be called from the main method. Q.What does a MapReduce partitioner do? Ans: A MapReduce partitioner makes sure that all the value of a single key goes to the same reducer, thus allows evenly distribution of the map output over the reducers. It redirects the mapper output to the reducer by determining which reducer is responsible for a particular key. Q.How is Hadoop different from other data processing tools? Ans: In Hadoop, based upon your requirements, you can increase or decrease the number of mappers without bothering about the volume of data to be processed. this is the beauty of parallel processing in contrast to the other data processing tools available. Q.Can we rename the output file? Ans: Yes we can rename the output file by implementing multiple format output class. Q.Why we cannot do aggregation (addition) in a mapper? Why we require reducer for that? Ans: We cannot do aggregation (addition) in a mapper because, sorting is not done in a mapper. Sorting happens only on the reducer side. Mapper method initialization depends upon each input split. While doing aggregation, we will lose the value of the previous instance. For each row, a new mapper will get initialized. For each row, input split again gets divided into mapper, thus we do not have a track of the previous row value. Q.What is Streaming? Ans: Streaming is a feature with Hadoop framework that allows us to do programming using MapReduce in any programming language which can accept standard input and can produce standard output. It could be Perl, Python, Ruby and not necessarily be Java. However, customization in MapReduce can only be done using Java and not any other programming language. Q.What is a Combiner? Ans: A ‘Combiner’ is a mini reducer that performs the local reduce task. It receives the input from the mapper on a particular node and sends the output to the reducer. Combiners help in enhancing the efficiency of MapReduce by reducing the quantum of data that is required to be sent to the reducers. Q.What is the difference between an HDFS Block and Input Split? Ans: HDFS Block is the physical division of the data and Input Split is the logical division of the data. Q.What happens in a textinputformat? Ans: In textinputformat, each line in the text file is a record. Key is the byte offset of the line and value is the content of the line. For instance, Key: longWritable, value: text. Q.What do you know about keyvaluetextinputformat? Ans: In keyvaluetextinputformat, each line in the text file is a ‘record‘. The first separator character divides each line. Everything before the separator is the key and everything after the separator is the value. For instance, Key: text, value: text. Q.What do you know about Sequencefileinputformat? Ans: Sequencefileinputformat is an input format for reading in sequence files. Key and value are user defined. It is a specific compressed binary file format which is optimized for passing the data between the output of one MapReduce job to the input of some other MapReduce job. Q.What do you know about Nlineoutputformat? Ans: Nlineoutputformat splits ‘n’ lines of input as one split. Hadoop Hive Interview Questions and Answers Q.What is a Hive Metastore? Ans: Hive Metastore is a central repository that stores metadata in external database. Q.Are multiline comments supported in Hive? Ans: No Q.What is ObjectInspector functionality? Ans: ObjectInspector is used to analyze the structure of individual columns and the internal structure of the row objects. ObjectInspector in Hive provides access to complex objects which can be stored in multiple formats. Q.Explain about the different types of join in Hive. Ans: HiveQL has 4 different types of joins – JOIN Similar to Outer Join in SQL FULL OUTER JOIN – Combines the records of both the left and right outer tables that fulfil the join condition. LEFT OUTER JOIN All the rows from the left table are returned even if there are no matches in the right table. RIGHT OUTER JOINAll the rows from the right table are returned even if there are no matches in the left table. Q.How can you configure remote metastore mode in Hive? Ans: To configure metastore in Hive, hivesite.xml file has to be configured with the below property – hive.metastore.uris thrift: //node1 (or IP Address):9083 IP address and port of the metastore host Q.Explain about the SMB Join in Hive. Ans: In SMB join in Hive, each mapper reads a bucket from the first table and the corresponding bucket from the second table and then a merge sort join is performed. Sort Merge Bucket (SMB) join in hive is mainly used as there is no limit on file or partition or table join. SMB join can best be used when the tables are large. In SMB join the columns are bucketed and sorted using the join columns. All tables should have the same number of buckets in SMB join. Q.Is it possible to change the default location of Managed Tables in Hive, if so how? Ans: Yes, we can change the default location of Managed tables using the LOCATION keyword while creating the managed table. The user has to specify the storage path of the managed table as the value to the LOCATION keyword. Q.How data transfer happens from HDFS to Hive? Ans: If data is already present in HDFS then the user need not LOAD DATA that moves the files to the /user/hive/warehouse/. So the user just has to define the table using the keyword external that creates the table definition in the hive metastore. Create external table table_name ( id int, myfields string ) location '/my/location/in/hdfs'; Q.How can you connect an application, if you run Hive as a server? Ans: When running Hive as a server, the application can be connected in one of the 3 ways ODBC DriverThis supports the ODBC protocol JDBC Driver This supports the JDBC protocol Thrift Client This client can be used to make calls to all hive commands using different programming language like PHP, Python, Java, C++ and Ruby. Q.What does the overwrite keyword denote in Hive load statement? Ans: Overwrite keyword in Hive load statement deletes the contents of the target table and replaces them with the files referred by the file path i.e. the files that are referred by the file path will be added to the table when using the overwrite keyword. Q.What is SerDe in Hive? How can you write your own custom SerDe? Ans: SerDe is a Serializer DeSerializer. Hive uses SerDe to read and write data from tables. Generally, users prefer to write a Deserializer instead of a SerDe as they want to read their own data format rather than writing to it. If the SerDe supports DDL i.e. basically SerDe with parameterized columns and different column types, the users can implement a Protocol based DynamicSerDe rather than writing the SerDe from scratch. Q.In case of embedded Hive, can the same metastore be used by multiple users? Ans: We cannot use metastore in sharing mode. It is suggested to use standalone real database like PostGreSQL and MySQL. Hadoop Pig Interview Questions and Answers Q.What do you mean by a bag in Pig? Ans: Collection of tuples is referred as a bag in Apache Pig Q.Does Pig support multiline commands? Ans: Yes Q.What are different modes of execution in Apache Pig? Ans: Apache Pig runs in 2 modes one is the “Pig (Local Mode) Command Mode” and the other is the “Hadoop MapReduce (Java) Command Mode”. Local Mode requires access to only a single machine where all files are installed and executed on a local host whereas MapReduce requires accessing the Hadoop cluster. Q.Explain the need for MapReduce while programming in Apache Pig. Ans: Apache Pig programs are written in a query language known as Pig Latin that is similar to the SQL query language. To execute the query, there is need for an execution engine. The Pig engine converts the queries into MapReduce jobs and thus MapReduce acts as the execution engine and is needed to run the programs. Q.Explain about cogroup in Pig. Ans: COGROUP operator in Pig is used to work with multiple tuples. COGROUP operator is applied on statements that contain or involve two or more relations. The COGROUP operator can be applied on up to 127 relations at a time. When using the COGROUP operator on two tables at oncePig first groups both the tables and after that joins the two tables on the grouped columns. Q.Explain about the BloomMapFile. Ans: BloomMapFile is a class that extends the MapFile class. It is used n HBase table format to provide quick membership test for the keys using dynamic bloom filters. Q.Differentiate between Hadoop MapReduce and Pig Ans: Pig provides higher level of abstraction whereas MapReduce provides low level of abstraction. MapReduce requires the developers to write more lines of code when compared to Apache Pig. Pig coding approach is comparatively slower than the fully tuned MapReduce coding approach. Q.What is the usage of foreach operation in Pig scripts? Ans: FOREACH operation in Apache Pig is used to apply transformation to each element in the data bag so that respective action is performed to generate new data items. Syntax FOREACH data_bagname GENERATE exp1, exp2 Q.Explain about the different complex data types in Pig. Ans: Apache Pig supports 3 complex data types Maps These are key, value stores joined together using #. Tuples Just similar to the row in a table where different items are separated by a comma. Tuples can have multiple attributes. Bags Unordered collection of tuples. Bag allows multiple duplicate tuples. Q.What does Flatten do in Pig? Ans: Sometimes there is data in a tuple or bag and if we want to remove the level of nesting from that data then Flatten modifier in Pig can be used. Flatten unnests bags and tuples. For tuples, the Flatten operator will substitute the fields of a tuple in place of a tuple whereas unnesting bags is a little complex because it requires creating new tuples. Hadoop Zookeeper Interview Questions and Answers Q.Can Apache Kafka be used without Zookeeper? Ans: It is not possible to use Apache Kafka without Zookeeper because if the Zookeeper is down Kafka cannot serve client request. Q.Name a few companies that use Zookeeper. Ans: Yahoo, Solr, Helprace, Neo4j, Rackspace Q.What is the role of Zookeeper in HBase architecture? Ans: In HBase architecture, ZooKeeper is the monitoring server that provides different services like –tracking server failure and network partitions, maintaining the configuration information, establishing communication between the clients and region servers, usability of ephemeral nodes to identify the available servers in the cluster. Q.Explain about ZooKeeper in Kafka Ans: Apache Kafka uses ZooKeeper to be a highly distributed and scalable system. Zookeeper is used by Kafka to store various configurations and use them across the hadoop cluster in a distributed manner. To achieve distributedness, configurations are distributed and replicated throughout the leader and follower nodes in the ZooKeeper ensemble. We cannot directly connect to Kafka by bye passing ZooKeeper because if the ZooKeeper is down it will not be able to serve the client request. Q.Explain how Zookeeper works ZooKeeper is referred to as the King of Coordination and distributed applications use ZooKeeper to store and facilitate important configuration information updates. ZooKeeper works by coordinating the processes of distributed applications. ZooKeeper is a robust replicated synchronization service with eventual consistency. A set of nodes is known as an ensemble and persisted data is distributed between multiple nodes. 3 or more independent servers collectively form a ZooKeeper cluster and elect a master. One client connects to any of the specific server and migrates if a particular node fails. The ensemble of ZooKeeper nodes is alive till the majority of nods are working. The master node in ZooKeeper is dynamically selected by the consensus within the ensemble so if the master node fails then the role of master node will migrate to another node which is selected dynamically. Writes are linear and reads are concurrent in ZooKeeper. Q.List some examples of Zookeeper use cases. Ans: Found by Elastic uses Zookeeper comprehensively for resource allocation, leader election, high priority notifications and discovery. The entire service of Found built up of various systems that read and write to Zookeeper. Apache Kafka that depends on ZooKeeper is used by LinkedIn Storm that relies on ZooKeeper is used by popular companies like Groupon and Twitter. Q.How to use Apache Zookeeper command line interface? Ans: ZooKeeper has a command line client support for interactive use. The command line interface of ZooKeeper is similar to the file and shell system of UNIX. Data in ZooKeeper is stored in a hierarchy of Znodes where each znode can contain data just similar to a file. Each znode can also have children just like directories in the UNIX file system. Zookeeperclient command is used to launch the command line client. If the initial prompt is hidden by the log messages after entering the command, users can just hit ENTER to view the prompt. Q.What are the different types of Znodes? Ans: There are 2 types of Znodes namely Ephemeral and Sequential Znodes. The Znodes that get destroyed as soon as the client that created it disconnects are referred to as Ephemeral Znodes. Sequential Znode is the one in which sequential number is chosen by the ZooKeeper ensemble and is prefixed when the client assigns name to the znode. Q.What are watches? Ans: Client disconnection might be troublesome problem especially when we need to keep a track on the state of Znodes at regular intervals. ZooKeeper has an event system referred to as watch which can be set on Znode to trigger an event whenever it is removed, altered or any new children are created below it. Q.What problems can be addressed by using Zookeeper? Ans: In the development of distributed systems, creating own protocols for coordinating the hadoop cluster results in failure and frustration for the developers. The architecture of a distributed system can be prone to deadlocks, inconsistency and race conditions. This leads to various difficulties in making the hadoop cluster fast, reliable and scalable. To address all such problems, Apache ZooKeeper can be used as a coordination service to write correct distributed applications without having to reinvent the wheel from the beginning. Hadoop Flume Interview Questions and Answers Q.Explain about the core components of Flume. Ans: The core components of Flume are – Event- The single log entry or unit of data that is transported. Source- This is the component through which data enters Flume workflows. Sink-It is responsible for transporting data to the desired destination. Channel- it is the duct between the Sink and Source. Agent- Any JVM that runs Flume. Client- The component that transmits event to the source that operates with the agent. Q.Does Flume provide 100% reliability to the data flow? Ans: Yes, Apache Flume provides end to end reliability because of its transactional approach in data flow. Q.How can Flume be used with HBase? Ans: Apache Flume can be used with HBase using one of the two HBase sinks – HBaseSink (org.apache.flume.sink.hbase.HBaseSink) supports secure HBase clusters and also the novel HBase IPC that was introduced in the version HBase 0.96. AsyncHBaseSink (org.apache.flume.sink.hbase.AsyncHBaseSink) has better performance than HBase sink as it can easily make non-blocking calls to HBase. Working of the HBaseSink – In HBaseSink, a Flume Event is converted into HBase Increments or Puts. Serializer implements the HBaseEventSerializer which is then instantiated when the sink starts. For every event, sink calls the initialize method in the serializer which then translates the Flume Event into HBase increments and puts to be sent to HBase cluster. Working of the AsyncHBaseSink- AsyncHBaseSink implements the AsyncHBaseEventSerializer. The initialize method is called only once by the sink when it starts. Sink invokes the setEvent method and then makes calls to the getIncrements and getActions methods just similar to HBase sink. When the sink stops, the cleanUp method is called by the serializer. Q.Explain about the different channel types in Flume. Which channel type is faster? Ans: The 3 different built in channel types available in Flume are- MEMORY Channel – Events are read from the source into memory and passed to the sink. JDBC Channel – JDBC Channel stores the events in an embedded Derby database. FILE Channel –File Channel writes the contents to a file on the file system after reading the event from a source. The file is deleted only after the contents are successfully delivered to the sink. MEMORY Channel is the fastest channel among the three however has the risk of data loss. The channel that you choose completely depends on the nature of the big data application and the value of each event. Q.Which is the reliable channel in Flume to ensure that there is no data loss? Ans: FILE Channel is the most reliable channel among the 3 channels JDBC, FILE and MEMORY. Q.Explain about the replication and multiplexing selectors in Flume. Ans: Channel Selectors are used to handle multiple channels. Based on the Flume header value, an event can be written just to a single channel or to multiple channels. If a channel selector is not specified to the source then by default it is the Replicating selector. Using the replicating selector, the same event is written to all the channels in the source’s channels list. Multiplexing channel selector is used when the application has to send different events to different channels. Q.How multi-hop agent can be setup in Flume? Ans: Avro RPC Bridge mechanism is used to setup Multi-hop agent in Apache Flume. Q.Does Apache Flume provide support for third party plug-ins? Ans: Most of the data analysts use Apache Flume has plug-in based architecture as it can load data from external sources and transfer it to external destinations. Q.Is it possible to leverage real time analysis on the big data collected by Flume directly? If yes, then explain how. Ans: Data from Flume can be extracted, transformed and loaded in real-time into Apache Solr servers usingMorphlineSolrSink Q.Differentiate between FileSink and FileRollSink Ans: The major difference between HDFS FileSink and FileRollSink is that HDFS File Sink writes the events into the Hadoop Distributed File System (HDFS) whereas File Roll Sink stores the events into the local file system. Hadoop Sqoop Interview Questions and Answers Q.Explain about some important Sqoop commands other than import and export. Ans: Create Job (--create) Here we are creating a job with the name my job, which can import the table data from RDBMS table to HDFS. The following command is used to create a job that is importing data from the employee table in the db database to the HDFS file. $ Sqoop job --create myjob \ --import \ --connect jdbc:mysql://localhost/db \ --username root \ --table employee --m 1 Verify Job (--list) ‘--list’ argument is used to verify the saved jobs. The following command is used to verify the list of saved Sqoop jobs. $ Sqoop job --list Inspect Job (--show) ‘--show’ argument is used to inspect or verify particular jobs and their details. The following command and sample output is used to verify a job called myjob. $ Sqoop job --show myjob Execute Job (--exec) ‘--exec’ option is used to execute a saved job. The following command is used to execute a saved job called myjob. $ Sqoop job --exec myjob Q.How Sqoop can be used in a Java program? Ans: The Sqoop jar in classpath should be included in the java code. After this the method Sqoop.runTool () method must be invoked. The necessary parameters should be created to Sqoop programmatically just like for command line. Q.What is the process to perform an incremental data load in Sqoop? Ans: The process to perform incremental data load in Sqoop is to synchronize the modified or updated data (often referred as delta data) from RDBMS to Hadoop. The delta data can be facilitated through the incremental load command in Sqoop. Incremental load can be performed by using Sqoop import command or by loading the data into hive without overwriting it. The different attributes that need to be specified during incremental load in Sqoop are- 1)Mode (incremental) –The mode defines how Sqoop will determine what the new rows are. The mode can have value as Append or Last Modified. 2)Col (Check-column) –This attribute specifies the column that should be examined to find out the rows to be imported. 3)Value (last-value) –This denotes the maximum value of the check column from the previous import operation. Q.Is it possible to do an incremental import using Sqoop? Ans: Yes, Sqoop supports two types of incremental imports- 1)Append 2)Last Modified To insert only rows Append should be used in import command and for inserting the rows and also updating Last-Modified should be used in the import command. Q.What is the standard location or path for Hadoop Sqoop scripts? /usr/bin/Hadoop Sqoop Q.How can you check all the tables present in a single database using Sqoop? Ans: The command to check the list of all tables present in a single database using Sqoop is as follows- Sqoop list-tables –connect jdbc: mysql: //localhost/user; Q.How are large objects handled in Sqoop? Ans: Sqoop provides the capability to store large sized data into a single field based on the type of data. Sqoop supports the ability to store- 1)CLOB ‘s – Character Large Objects 2)BLOB’s –Binary Large Objects Large objects in Sqoop are handled by importing the large objects into a file referred as “LobFile” i.e. Large Object File. The LobFile has the ability to store records of huge size, thus each record in the LobFile is a large object. Q.Can free form SQL queries be used with Sqoop import command? If yes, then how can they be used? Ans: Sqoop allows us to use free form SQL queries with the import command. The import command should be used with the –e and – query options to execute free form SQL queries. When using the –e and –query options with the import command the –target dir value must be specified. Q.Differentiate between Sqoop and distCP. Ans: DistCP utility can be used to transfer data between clusters whereas Sqoop can be used to transfer data only between Hadoop and RDBMS. Q.What are the limitations of importing RDBMS tables into Hcatalog directly? Ans: There is an option to import RDBMS tables into Hcatalog directly by making use of –hcatalog –database option with the –hcatalog –table but the limitation to it is that there are several arguments like –as-avrofile , -direct, -as-sequencefile, -target-dir , -export-dir are not supported. Hadoop HBase Interview Questions and Answers Q.When should you use HBase and what are the key components of HBase? Ans: HBase should be used when the big data application has – 1)A variable schema 2)When data is stored in the form of collections 3)If the application demands key based access to data while retrieving. Key components of HBase are – Region- This component contains memory data store and Hfile. Region Server-This monitors the Region. HBase Master-It is responsible for monitoring the region server. Zookeeper- It takes care of the coordination between the HBase Master component and the client. Catalog Tables-The two important catalog tables are ROOT and META.ROOT table tracks where the META table is and META table stores all the regions in the system. Q.What are the different operational commands in HBase at record level and table level? Ans: Record Level Operational Commands in HBase are –put, get, increment, scan and delete. Table Level Operational Commands in HBase are-describe, list, drop, disable and scan. Q.What is Row Key? Ans: Every row in an HBase table has a unique identifier known as RowKey. It is used for grouping cells logically and it ensures that all cells that have the same RowKeys are co-located on the same server. RowKey is internally regarded as a byte array. Q.Explain the difference between RDBMS data model and HBase data model. Ans: RDBMS is a schema based database whereas HBase is schema less data model. RDBMS does not have support for in-built partitioning whereas in HBase there is automated partitioning. RDBMS stores normalized data whereas HBase stores de-normalized data. Q.Explain about the different catalog tables in HBase? Ans: The two important catalog tables in HBase, are ROOT and META. ROOT table tracks where the META table is and META table stores all the regions in the system. Q.What is column families? What happens if you alter the block size of ColumnFamily on an already populated database? Ans: The logical deviation of data is represented through a key known as column Family. Column families consist of the basic unit of physical storage on which compression features can be applied. In an already populated database, when the block size of column family is altered, the old data will remain within the old block size whereas the new data that comes in will take the new block size. When compaction takes place, the old data will take the new block size so that the existing data is read correctly. Q.Explain the difference between HBase and Hive. Ans: HBase and Hive both are completely different hadoop based technologies-Hive is a data warehouse infrastructure on top of Hadoop whereas HBase is a NoSQL key value store that runs on top of Hadoop. Hive helps SQL savvy people to run MapReduce jobs whereas HBase supports 4 primary operations-put, get, scan and delete. HBase is ideal for real time querying of big data where Hive is an ideal choice for analytical querying of data collected over period of time. Q.Explain the process of row deletion in HBase. Ans: On issuing a delete command in HBase through the HBase client, data is not actually deleted from the cells but rather the cells are made invisible by setting a tombstone marker. The deleted cells are removed at regular intervals during compaction. Q.What are the different types of tombstone markers in HBase for deletion? Ans: There are 3 different types of tombstone markers in HBase for deletion- 1)Family Delete Marker- This markers marks all columns for a column family. 2)Version Delete Marker-This marker marks a single version of a column. 3)Column Delete Marker-This markers marks all the versions of a column. Q.Explain about HLog and WAL in HBase. Ans: All edits in the HStore are stored in the HLog. Every region server has one HLog. HLog contains entries for edits of all regions performed by a particular Region Server.WAL abbreviates to Write Ahead Log (WAL) in which all the HLog edits are written immediately.WAL edits remain in the memory till the flush period in case of deferred log flush. contact for more on Hadoop Online Training
Continue reading
Informatica Interview Questions
INFORMATICA INTERVIEW QUESTIONS & ANSWERS Q.What Can We Do To Improve The Performance Of Informatica Aggregator Transformation? Ans: Aggregator performance improves dramatically if records are sorted before passing to the aggregator and "sorted input" option under aggregator properties is checked. The record set should be sorted on those columns that are used in Group By operation. It is often a good idea to sort the record set in database level e.g. inside a source qualifier transformation, unless there is a chance that already sorted records from source qualifier can again become unsorted before reaching aggregator. Q.How To Delete Duplicate Row Using Informatica? Ans: Assuming that the source system is a Relational Database, to eliminate duplicate records, we can check the Distinct option of the Source Qualifier of the source table and load the target accordingly. Q.What Are The Different Lookup Cache? Ans: Lookups can be cached or uncached (No cache). Cached lookup can be either static or dynamic. A static cache is one which does not modify the cache once it is built and it remains same during the session run. On the other hand, A dynamic cache is refreshed during the session run by inserting or updating the records in cache based on the incoming source data. A lookup cache can also be divided as persistent or non-persistent based on whether Informatica retains the cache even after session run is complete or not respectively. Q.How Can We Update A Record In Target Table Without Using Update Strategy? Ans: A target table can be updated without using 'Update Strategy'. For this, we need to define the key in the target table in Informatica level and then we need to connect the key and the field we want to update in the mapping Target. In the session level, we should set the target property as "Update as Update" and check the "Update" check-box. Let's assume we have a target table "Customer" with fields as "Customer ID", "Customer Name" and "Customer Address". Suppose we want to update "Customer Address" without an Update Strategy. Then we have to define "Customer ID" as primary key in Informatica level and we will have to connect Customer ID and Customer Address fields in the mapping. If the session properties are set correctly as described above, then the mapping will only update the customer address field for all matching customer IDs. Q. What Type Of Repositories Can Be Created Using Informatica Repository Manager? Ans: Informatica PowerCenter includeds following type of repositories : Standalone Repository : A repository that functions individually and this is unrelated to any other repositories. Global Repository : This is a centralized repository in a domain. This repository can contain shared objects across the repositories in a domain. The objects are shared through global shortcuts. Local Repository : Local repository is within a domain and it’s not a global repository. Local repository can connect to a global repository using global shortcuts and can use objects in it’s shared folders. Versioned Repository : This can either be local or global repository but it allows version control for the repository. A versioned repository can store multiple copies, or versions of an object. This features allows to efficiently develop, test and deploy metadata in the production environment. Q. What Is A Code Page? Ans: A code page contains encoding to specify characters in a set of one or more languages. The code page is selected based on source of the data. For example if source contains Japanese text then the code page should be selected to support Japanese text. When a code page is chosen, the program or application for which the code page is set, refers to a specific set of data that describes the characters the application recognizes. This influences the way that application stores, receives, and sends character data. Q. Which All Databases Powercenter Server On Windows Can Connect To? Ans: PowerCenter Server on Windows can connect to following databases: IBM DB2 Informix Microsoft Access Microsoft Excel Microsoft SQL Server Oracle Sybase Teradata Q. Which All Databases Powercenter Server On Unix Can Connect To? Ans: PowerCenter Server on UNIX can connect to following databases: IBM DB2 Informix Oracle Sybase Teradata Q. Explain Use Of Update Strategy Transformation? Ans: To flag source records as INSERT, DELETE, UPDATE or REJECT for target database. Default flag is Insert. This is must for Incremental Data Loading. This is the important transformation,is used to maintain the history data or just most recent changes into the target table. We can set or flag the records by using these two levels. Within a session : When you configure the session,you can instruct the informatica server to either treat all the records in the same way. Within a mapping : within a mapping we use update strategy transformation to flag the records like insert,update,delete or reject. Q. What Are The Measure Objects? Ans: Aggregate calculation like sum,avg,max,min these are the measure objetcs. Q. Discuss The Advantages & Disadvantages Of Star & Snowflake Schema? Ans: In a STAR schema there is no relation between any two dimension tables, whereas in a SNOWFLAKE schema there is a possible relation between the dimension tables. In star schema there is no relationship between two relational tables. All dimensions are de-normalized and query performance is degrades. In this snow flake schema dimensions are normalized. In this SF schema table space is increased. Maintenance cost is high. Query performance is increased. Q. What Is The Method Of Loading 5 Flat Files Of Having Same Structure To A Single Target And Which Transformations I Can Use? Ans: Two Methods. write all files in one directory then use file repository concept (dont forget to type source file type as indirect in the session). use union t/r to combine multiple input files into a single target. Q. Compare Data Warehousing Top-down Approach With Bottom-up Approach. Ans: Top down ODS-->ETL-->Datawarehouse-->Datamart-->OLAP Bottom up ODS-->ETL-->Datamart-->Datawarehouse-->OLAP Q. Why We Use Partitioning The Session In Informatica? Ans: Performance can be improved by processing data in parallel in a single session by creating multiple partitions of the pipeline. Informatica server can achieve high performance by partitioning the pipeline and performing the extract , transformation, and load for each partition in parallel. Q. What Is The Router Transformation? Ans: A Router transformation is similar to a Filter transformation because both transformations allow you to use a condition to test data. However, a Filter transformation tests data for one condition and drops the rows of data that do not meet the condition. A Router transformation tests data for one or more conditions and gives you the option to route rows of data that do not meet any of the conditions to a default output group. If you need to test the same input data based on multiple conditions, use a Router Transformation in a mapping instead of creating multiple Filter transformations to perform the same task. Q. How Do You Create A Mapping Using Multiple Lookup Transformation? Ans: Use unconnected lookup if same lookup repeats multiple times. Q. Difference Between Summary Filter And Details Filter? Ans: Summary Filter - we can apply records group by that contain common values. Detail Filter - we can apply to each and every record in a database. Q. Can Informatica Be Used As A Cleansing Tool? If Yes, Give Example Of Transformations That Can Implement A Data Cleansing Routine. Ans: Yes, we can use Informatica for cleansing data. some time we use stages to cleansing the data. It depends upon performance again else we can use expression to cleasing data. For example an feild X have some values and other with Null values and assigned to target feild where target feild is notnull column, inside an expression we can assign space or some constant value to avoid session failure. The input data is in one format and target is in another format, we can change the format in expression. we can assign some default values to the target to represent complete set of data in the target. Q. How Can You Create Or Import Flat File Definition Into The Warehouse Designer? Ans: You cannot create or import flat file definition in to warehouse designer directly. Instead you must analyze the file in source analyzer, then drag it into the warehouse designer. When you drag the flat file source definition into warehouse designer workspace, the warehouse designer creates a relational target definition not a file definition. If you want to load to a file, configure the session to write to a flat file. When the informatica server runs the session, it creates and loads the flat file. Q. Is A Fact Table Normalized Or De-normalized? Ans: A fact table is always DENORMALISED table. It consists of data from dimension table (Primary Key's) and Fact table has foreign keys and measures. Q. How You Will Create Header And Footer In Target Using Informatica? Ans: If you are focus is about the flat files then one can set it in file properties while creating a mapping or at the session level in session properties. Q. Explain About Informatica Server Architecture? Ans: Informatica server, load managers, data transfer manager, reader, temp server and writer are the components of informatica server. first load manager sends a request to the reader if the reader is ready to read the data from source and dump into the temp server and data transfer manager manages the load and it send the request to writer as per first in first out process and writer takes the data from temp server and loads it into the target. Q. What Are The Basic Needs To Join Two Sources In A Source Qualifier? Ans: Basic need to join two sources using source qualifier: Both sources should be in same database The should have at least one column in common with same data types Q. How Do You Configure Mapping In Informatica? Ans: Import src from database. Check if target table already exists in database. If it exists make sure u delete data from it and import into designer. Or else create it with create target wizard. Now you can drag needed transformations into the workspace. Use them according to your purpose. For improved performance follow these tips:- Use sorter before aggregator If filter is used keep it as near to the source as possible. If possible use an extra expression tr before target to make corrections in future. Enable sorted input option if sorter is used before agg tr. If more filters are needed use router tr. you can use source filter option of SQ if filter tr is immediately after source. In case of router if not needed do not connect default group to any target. Q. How Many Types Of Dimensions Are Available In Informatica? Ans: The types of dimensions available are: Junk dimension Degenerative Dimension Conformed Dimension Q. What Is The Difference Between Filter And Lookup Transformation? Ans: Filter transformation is an Active transformation and Lookup is a Passive transformation. Filter transformation is used to Filter rows based on condition and Lookup is used to look up data in a flat file or a relational table, view, or synonym. Q. What Are The Joiner Caches? Ans: Specifies the directory used to cache master records and the index to these records. By default, the cached files are created in a directory specified by the server variable $PMCacheDir. If you override the directory, make sure the directory exists and contains enough disk space for the cache files. The directory can be a mapped or mounted drive. There are 2-types of cache in the joiner: Data cache Index Cache Q. What Is The Difference Between Informatica 7.0 And 8.0 ? Ans: The basic difference between informatica 8.0 and informatica 7.0 is that in 8.0 series informatica corp has introduces powerexchnage concept. Q. Which Is Better Among Connected Lookup And Unconnected Lookup Transformations In Informatica Or Any Other Etl Tool? Ans: If you are having defined source you can use connected, source is not well defined or from different database you can go for unconnected. Connected and unconnected lookup depends on scenarios and performance If you are looking for a single value for look up and the value is like 1 in 1000 then you should go for unconnected lookup. Performance wise its better as we are not frequently using the transformation. If multiple columns are returned as lookup value then one should go for connected lookup. Q. How To Read Rejected Data Or Bad Data From Bad File And Reload It To Target? Ans: Correction the rejected data and send to target relational tables using load order utility. Find out the rejected data by using column indicator and row indicator. Q. Which Tasks Can Be Performed On Port Level(using One Specific Port)? Ans: I think unconnected Lookup or expression transformation can be used for single port for a row. Q. Differences Between Normalizer And Normalizer Transformation. Ans: Normalizer : It is a transformation mainly using for cobol sources. It change the rows into columns and columns into rows. Normalization : To remove the redundancy and inconsistency. Normalizer Transformation : can be used to obtain multiple columns from a single row. Q. In A Joiner Transformation, You Should Specify The Source With Fewer Rows As The Master Source. Why? Ans: Joiner transformation compares each row of the master source against the detail source. The fewer unique rows in the master, the fewer iterations of the join comparison occur, which speeds the join process. Joiner Transformation will cache Master table's data hence it is advised to define table with less #of rows as master. Q. How To Import Oracle Sequence Into Informatica? Ans: Create one procedure and declare the sequence inside the procedure, finally call the procedure in informatica with the help of stored procedure transformation. Q. How To Get The First 100 Rows From The Flat File Into The Target? Ans: create one procedure and declare the sequence inside the procedure, finally call the procedure in informatica with the help of stored procedure transformation. Q. How To Load Time Dimension? Ans: We can use SCD Type 1/2/3 to load any Dimensions based on the requirement. We can also use procedure to populate Time Dimension Q. What Is The Difference Between Informatics 7x And 8x And What Is Latest Version? Ans: Java Transformation available in the 8x version and it is not available in 7x version. Q. How Do You Handle Decimal Places While Importing A Flatfile Into Informatica? Ans: while importing flat file definition just specify the scale for a numeric data type in the mapping, the flat file source supports only number datatype (no decimal and integer). In the SQ associated with that source will have a data type as decimal for that number port of the source. source ->number datatype port ->SQ -> decimal datatype. Integer is not supported. hence decimal is taken care. Import the field as string and then use expression to convert it, so that we can avoid truncation if decimal places in source itself. Q. What Is Hash Table Informatica? Ans: In hash partitioning, the Informatica Server uses a hash function to group rows of data among partitions. The Informatica Server groups the data based on a partition key.Use hash partitioning when you want the Informatica Server to distribute rows to the partitions by group. For example, you need to sort items by item ID, but you do not know how many items have a particular ID number. Q. What Is The Use Of Incremental Aggregation? Explain Me In Brief With An Example. Ans: Its a session option, when the informatica server performs incremental aggr. it passes new source data through the mapping and uses historical cache data to perform new aggregation calculations incrementally for performance we will use it. When using incremental aggregation, you apply captured changes in the source to aggregate calculations in a session. If the source changes incrementally and you can capture changes, you can configure the session to process those changes. This allows the Integration Service to update the target incrementally, rather than forcing it to process the entire source and recalculate the same data each time you run the session. Q. How Can I Get Distinct Values While Mapping In Informatica In Insertion? Ans: You can add an aggregator before insert and group by the fields that need to be distinct. Q. At The Max How Many Transformations Can Be Us In A Mapping? Ans: In a mapping we can use any number of transformations depending on the project, and the included transformations in the particular related transformations. Q. What Is The Target Load Order? Ans: A target load order group is the collection of source qualifiers, transformations, and targets linked together in a mapping. You specify the target load order based on source qualifiers in a mapping. If you have the multiple source qualifiers connected to the multiple targets, you can designate the order in which Informatica server loads data into the targets. Q. How To Recover Sessions In Concurrent Batches? Ans: If multiple sessions in a concurrent batch fail, you might want to truncate all targets and run the batch again. However, if a session in a concurrent batch fails and the rest of the sessions complete successfully, you can recover the session as a standalone session. To recover a session in a concurrent batch: Copy the failed session using Operations-Copy Session. Drag the copied session outside the batch to be a standalone session. Follow the steps to recover a standalone session. Delete the standalone copy. Q. What Is Difference Between Maplet And Reusable Transformation? Ans: Maplet : one or more transformations. set of transformations that are reusable. Reusable transformation : only one transformation Single transformation which is reusable. Q. What Is The Difference Between Stop And Abort? Ans: stop : If the session u want to stop is a part of batch you must stop the batch, if the batch is part of nested batch, Stop the outer most batch. Abort : You can issue the abort command, it is similar to stop command except it has 60 second time out. Q. How Can You Complete Unrecoverable Sessions? Ans: Under certain circumstances, when a session does not complete, you need to truncate the target tables and run the session from the beginning. Run the session from the beginning when the Informatica Server cannot run recovery or when running recovery might result in inconsistent data. If there is no recovery mode on in session and workflow failed in mid of execution then Don’t truncate table immediately. If there is large volume of data is performing by the load and more than 25% data has loaded then-if same workflow has multiple session then check particular session which caused to be failed and fire the delete command only to delete particular session data which has loaded and copy the session into new workflow and run only that session or dependent others. Q. What Are Partition Points? Ans: Partition points mark the thread boundaries in a source pipeline and divide the pipeline into stages. Q. What Is A Source Qualifier? Ans: When you add a relational or a flat file source definition to a mapping, you need to connect it to a Source Qualifier transformation. The Source Qualifier represents the rows that the Informatica Server reads when it executes a session. Q. How Do We Estimate The Number Of Partitions That A Mapping Really Requires? Is It Dependent On The Machine Configuration? Ans: It depends upon the informatica version we are using. suppose if we are using informatica 6 it supports only 32 partitions where as informatica 7 supports 64 partitions. Q. How Can You Improve Session Performance In Aggregator Transformation? Ans: One way is supplying the sorted input to aggregator transformation. In situations where sorted input cannot be supplied, we need to configure data cache and index cache at session/transformation level to allocate more space to support aggregation. Q. What Is The Difference Between Summary Filter And Detail Filter? Ans: Summary filter can be applieid on a group of rows that contain a common value. whereas detail filters can be applied on each and every rec of the data base. Q. What Is The Default Source Option For Update Strategy Transformation? Ans: Default option for update strategy transformation is dd_insert or we can put '0' in session level data driven. Q. What Is Meant By Aggregate Fact Table And Where Is It Used? Ans: Basically fact tables are two kinds. Aggregated fact table and Factless fact table. Aggregated fact table has aggregarted columns. for eg. Total-Sal, Dep-Sal. where as in factless fact table will not have aggregated columns and it only has FK to the Dimension tables. Q. Can You Start A Batches With In A Batch? Ans: You cannot. If you want to start batch that resides in a batch, create a new independent batch and copy the necessary sessions into the new batch. Q. What Is The Default Join That Source Qualifier Provides? Ans: Inner equi join. Q. What Are The Difference Between Joiner Transformation And Source Qualifier Transformation? Ans: You can join heterogeneous data sources in joiner transformation which we cannot achieve in source qualifier transformation. You need matching keys to join two relational sources in source qualifier transformation. Whereas you doesn’t need matching keys to join two sources. Two relational sources should come from same data source in source qualifier. You can join relational sources which are coming from different sources also. Q. How The Informatica Server Increases The Session Performance Through Partitioning The Source? Ans: For a relational sources informatica server creates multiple connections for each partition of a single source and extracts separate range of data for each connection. Informatica server reads multiple partitions of a single source concurrently. Similarly for loading also informatica server creates multiple connections to the target and loads partitions of data concurrently. For XML and file sources, informatica server reads multiple files concurrently. For loading the data informatica server creates a separate file for each partition (of a source file). You can choose to merge the targets. Q. How Can We Use pmcmd Command In A Workflow Or To Run A Session Ans: By using command in the command task there is a option pression. we can write appropriate command of pmcmd to run workflow. Q. Doubts Regarding Rank Transformation: Can We Do Ranking Using Two Ports? Can We Rank All The Rows Coming From Source, How? Ans: When ETL load the data from source we can declare the rank of the incoming data to pass a rank transformation. We can't declare two rank on a single source data. We can do rank the row by declaring the rank Transformation and declaring the rank port. Q. Define Informatica Repository? Ans: The informatica repository is at the center of the informatica suite. You create a set of metadata tables within the repository database that the informatica application and tools access. The informatica client and server access the repository to save and retrieve metadata. Q. What Are The Methods For Creating Reusable Transformations? Ans: There two methods for creating reusable transformations: Using transformation developer tool. Converting a non reusable transformation into a reusable transformation in mapping. Q. What Is The Difference Between Normal Load And Bulk Load? Ans: Normal Load : Normal load will write information to the database log file so that if any recovery is needed it is will be helpful. when the source file is a text file and loading data to a table, in such cases we should you normal load only, else the session will be failed. Bulk Mode : Bulk load will not write information to the database log file so that if any recovery is needed we can't do any thing in such cases. comparatively Bulk load is pretty faster than normal load. Q. What Is The Difference Between Informatica Powercenter Server, Repositoryserver And Repository? Ans: Repository is a database in which all informatica components are stored in the form of tables. The repository server controls the repository and maintains the data integrity and Consistency across the repository when multiple users use Informatica. Powercenter Server/Infa Server is responsible for execution of the components (sessions) stored in the repository. Q. How Can You Access The Remote Source Into Your Session? Ans: Relational source : To access relational source which is situated in a remote place , you need to configure database connection to the datasource. FileSource : To access the remote source file you must configure the FTP connection to the host machine before you create the session. Heterogeneous : When you are mapping contains more than one source type, the server manager creates a heterogeneous session that displays source options for all types. Q. What Is The Difference Between Connected And Unconnected Stored Procedures? Ans: Unconnected: The unconnected Stored Procedure transformation is not connected directly to the flow of the mapping. It either runs before or after the session, or is called by an expression in another transformation in the mapping. connected: The flow of data through a mapping in connected mode also passes through the Stored Procedure transformation. All data entering the transformation through the input ports affects the stored procedure. You should use a connected Stored Procedure transformation when you need data from an input port sent as an input parameter to the stored procedure, or the results of a stored procedure sent as an output parameter to another transformation. Q. What Is Source Qualifier Transformation? Ans: When you add a relational or a flat file source definition to a mapping need to connect it to a source qualifier transformation. The source qualifier transformation represents the records that the informatica server reads when it runs a session. Q. To Provide Support For Mainframes Source Data,which Files Are Used As A Source Definitions? Ans: COBOL Copy-book files. Q. How Many Types Of Facts Are There And What Are They? Ans: There are three types of facts Additive fact: a fact which can be summarized by any one of dimension or all dimensions EX: QTY, REVENUE Semi additive fact: a fact which can be summarized for few dimensions not for all dimensions. ex: current balance Non additive fact: a fact which cannot be summarized by any of dimensions. ex: percentage of profit Q. Can We Run A Group Of Sessions Without Using Workflow Manager. Ans: It is possible two run two session only (by precession, post session) using pmcmd without using workflow. Not more than two. Q. If You Want To Create Indexes After The Load Process Which Transformation You Choose? Ans: Its usually not done in the mapping (transformation) level. Its done in session level. Create a command task which will execute a shell script (if Unix) or any other scripts which contains the create index command. Use this command task in the workflow after the session or else, You can create it with a post session command. Q. How Do I Import Vsam Files From Source To Target. Do I Need A Special Plugin Ans: As far my knowledge by using power exchange tool convert VSAM file to oracle tables then do mapping as usual to the target table. Q. How Can We Partition A Session In Informatica? Ans: The Informatica PowerCenter Partitioning option optimizes parallel processing on multiprocessor hardware by providing a thread-based architecture and built-in data partitioning. GUI-based tools reduce the development effort necessary to create data partitions and streamline ongoing troubleshooting and performance tuning tasks, while ensuring data integrity throughout the execution process. As the amount of data within an organization expands and real-time demand for information grows, the PowerCenter Partitioning option enables hardware and applications to provide outstanding performance and jointly scale to handle large volumes of data and users. Q. What Is The Procedure To Load The Fact Table. Give In Detail? Ans: Based on the requirement to your fact table, choose the sources and data and transform it based on your business needs. For the fact table, you need a primary key so use a sequence generator transformation to generate a unique key and pipe it to the target (fact) table with the foreign keys from the source tables. Q. What Are The Differences Between Informatica Power Center Versions 6.2 And 7.1, Also Between Versions 6.2 And 5.1? Ans: The main difference between informatica 5.1 and 6.1 is that in 6.1 they introduce a new thing called repository server and in place of server manager(5.1), they introduce workflow manager and workflow monitor. Q. Why We Use Stored Procedure Transformation? Ans: A Stored Procedure transformation is an important tool for populating and maintaining databases. Database administrators create stored procedures to automate time-consuming tasks that are too complicated for standard SQL statements. Q. What Is Difference Between Partitioning Of Relational Target And Partitioning Of File Targets? Ans: Partition's can be done on both relational and flat files. Informatica supports following partitions Database partitioning RoundRobin Pass-through Hash-Key partitioning Key Range partitioning All these are applicable for relational targets. For flat file only database partitioning is not applicable. Informatica supports Navy partitioning. you can just specify the name of the target file and create the partitions, rest will be taken care by informatica session. Q. What Are Two Modes Of Data Movement In Informatica Server? Ans: The data movement mode depends on whether Informatica Server should process single byte or multi-byte character data. This mode selection can affect the enforcement of code page relationships and code page validation in the Informatica Client and Server. Unicode - IS allows 2 bytes for each character and uses additional byte for each non-ascii character (such as Japanese characters) ASCII - IS holds all data in a single byte.The IS data movement mode can be changed in the Informatica Server configuration parameters. This comes into effect once you restart the Informatica Server. Q. What Are Main Advantages And Purpose Of Using Normalizer Transformation In Informatica? Ans: Narmalizer Transformation is used mainly with COBOL sources where most of the time data is stored in de-normalized format. Also, Normalizer transformation can be used to create multiple rows from a single row of data. Normalizer Transformation read the data from COBOL Sources. It support Horizontal Pivot .It is a processing of single input into a multiple output Q. What Is Change Data Capture? Ans: Change data capture (CDC) is a set of software design patterns used to determine the data that has changed in a database so that action can be taken using the changed data. Q. How To Delete Duplicate Rows In Flat Files Source Is Any Option In Informatica? Ans: Use a sorter transformation, in that you will have a "distinct" option make use of it . Q. What Is Update Strategy Transformation ? Ans: The model you choose constitutes your update strategy, how to handle changes to existing rows. In PowerCenter and PowerMart, you set your update strategy at two different levels: Within a session : When you configure a session, you can instruct the Informatica Server to either treat all rows in the same way (for example, treat all rows as inserts), or use instructions coded into the session mapping to flag rows for different database operations. Within a mapping : Within a mapping, you use the Update Strategy transformation to flag rows for insert, delete, update, or reject. update strategy is used to update the target Q. Discuss Which Is Better Among Incremental Load, Normal Load And Bulk Load. Ans: It depends on the requirement. Otherwise Incremental load which can be better as it takes one that data which is not available previously on the target. According to performance bulk is better than normal. But both having some conditions in source data. Conditions are like: Does not contain any constraint in data. Dont use the double datatype if necessary to use then use it as last row of the table. It does not support the CHECK CONSTRAINT. Q. What Are The Unsupported Repository Objects For A Mapplet? Ans: COBOL source definition Joiner transformations Normalizer transformations Non reusable sequence generator transformations. Pre or post session stored procedures Target definitions Power mart 3.5 style Look Up functions XML source definitions IBM MQ source defintions. Q. What Is Data Merging, Data Cleansing, Sampling? Ans: Cleansing: TO identify and remove the redundancy and inconsistency. sampling: just sample the data through send the data from source to target. Data merging: It is a process of combining the data with similar structures in to a single output. Data Cleansing: It is a process of identifying and rectifying the inconsistent and inaccurate data into consistent and accurate data. Data Sampling: It is the process of sample by sending the data from source to target. Q. How The Informatica Server Sorts The String Values In Ranktransformation? Ans: When Informatica Server runs in UNICODE data movement mode ,then it uses the sort order configured in session properties. We can run informatica server either in UNICODE data moment mode or ASCII data moment mode. Unicode mode: In this mode informatica server sorts the data as per the sorted order in session. ASCII Mode: In this mode informatica server sorts the date as per the binary order. Q. Difference Between Static And Dynamic Cache And Explain With One Example? Ans: Static - Once the data is cached , it will not change, example unconnected lookup uses static cache. Dynamic - The cache is updated as to reflect the update in the table (or source) for which it is reffering to.(ex. connected lookup). Q. How To Join Two Tables Without Using The Joiner Transformation? Ans: Its possible to join the two or more tables by using source qualifier. But provided the tables should have relationship. When you drag and drop the tables you will getting the source qualifier for each table. Delete all the source qualifiers. Add a common source qualifier for all. Right click on the source qualifier you will find EDIT click on it. Click on the properties tab, you will find sql query in that you can write your sqls. You can also do it using Session --- mapping---source there you have an option called User Defined Join there you can write your SQL. Q. What Is Difference Between Informatica 7.1 And Abinitio? Ans: There is a lot of difference between informatica and AbInitio: In AbInitio we are using 3 parallelism but Informatica using 1 parallelism. In Ab Initio no scheduling option we can schedule manually or pl/sql script but informatica contains 4 scheduling options. In Ab Inition contains co-operating system but informatica is not. Ramp time is very quickly in Ab Initio compare than Informatica. Ab Initio is user friendly than Informatica. Q. What Is Meant By Direct And Indirect Loading Options In Sessions? Ans: Direct loading can be used to Single transformation where as indirect transformation can be used to multiple transformations or files. In the direct we can perform recovery process but in Indirect we cant do it . Q. When We Create A Target As Flat File And Source As Oracle. How Can I Specify First Rows As Column Names In Flat Files? Ans: Use a pre sql statement., but this is a hard coding method. If you change the column names or put in extra columns in the flat file, you will have to change the insert statement. You can also achieve this by changing the setting in the Informatica Repository manager to display the columns heading. The only disadvantage of this is that it will be applied on all the files that will be generated by this server. Contact for more on Informatica Online Training Informatica interview questions Tags Informatica interview questions and answers,Informatica online training, Informatica interview questions, Informatica training online, Informatica training, Informatica training institute, latest Informatica interview questions, best Informatica interview questions 2019, top 100 Informatica interview questions,sample Informatica interview questions,Informatica interview questions technical, best Informatica interview tips, best Informatica interview basics, Informatica Interview techniques,Informatica Interview Tips. For online training videos
Continue reading
MSBI Interview Questions
SSIS SQL Server Integration Services Q: What is SSIS? How it is related with SQL Server. Ans: SQL Server Integration Services (SSIS) is a component of SQL Server which can be used to perform a wide range of Data Migration and ETL operations. SSIS is a component in MSBI process of SQL Server. This is a platform for Integration and Workflow applications. It is known for a fast and flexible OLTP and OLAP extensions used for data extraction, transformation, and loading (ETL). The tool may also be used to automate maintenance of SQL Server databases and multidimensional data sets. Q.What are the tools associated with SSIS? Ans: We use Business Intelligence Development Studio (BIDS) and SQL Server Management Studio (SSMS) to work with Development of SSIS Projects. We use SSMS to manage the SSIS Packages and Projects. Q.What are the differences between DTS and SSIS Ans: Data Transformation Services SQL Server Integration Services Limited Error Handling Complex and powerful Error Handling Message Boxes in ActiveX Scripts Message Boxes in .NET Scripting No Deployment Wizard Interactive Deployment Wizard Limited Set of Transformation Good number of Transformations NO BI functionality Complete BI Integration Q.What is a workflow in SSIS 2014 ? Ans: Workflow is a set of instructions on to specify the Program Executor on how to execute tasks and containers within SSIS Packages. Q.What is the control flow? Ans: A control flow consists of one or more tasks and containers that execute when the package runs. To control order or define the conditions for running the next task or container in the package control flow, we use precedence constraints to connect the tasks and containers in a package. A subset of tasks and containers can also be grouped and run repeatedly as a unit within the package control flow. SQL Server Integration Services (SSIS) provides three different types of control flow elements: Containers that provide structures in packages, Tasks that provide functionality, and Precedence Constraints that connect the executables, containers, and tasks into an ordered control flow. Q.What is the data flow? Ans: Data flow consists of the sources and destinations that extract and load data, the transformations that modify and extend data, and the paths that link sources, transformations, and destinations The Data Flow task is the executable within the SSIS package that creates, orders, and runs the data flow. A separate instance of the data flow engine is opened for each Data Flow task in a package. Data Sources, Transformations, and Data Destinations are the three important categories in the Data Flow. Q.How does ErrorHandling work in SSIS Ans: When a data flow component applies a transformation to column data, extracts data from sources, or loads data into destinations, errors can occur. Errors frequently occur because of unexpected data values. Type of typical Errors in SSIS: Data Connection Errors, which occur incase the connection manager cannot be initialized with the connection string. This applies to both Data Sources and Data Destinations along with Control Flows that use the Connection Strings. Data Transformation Errors, which occur while data is being transformed over a Data Pipeline from Source to Destination. Expression Evaluation errors, which occur if expressions that are evaluated at run time perform invalid Q.What is environment variable in SSIS? Ans: An environment variable configuration sets a package property equal to the value in an environment variable. Environmental configurations are useful for configuring properties that are dependent on the computer that is executing the package. Q.What are the Transformations available in SSIS? Ans: AGGREGATE It applies aggregate functions to Record Sets to produce new output records from aggregated values. AUDIT Adds Package and Task level Metadata such as Machine Name, Execution Instance, Package Name, Package ID, etc.. CHARACTERMAP Performs SQL Server column level string operations such as changing data from lower case to upper case. CONDITIONALSPLIT– Separates available input into separate output pipelines based on Boolean Expressions configured for each output. COPYCOLUMN Add a copy of column to the output we can later transform the copy keeping the original for auditing. DATACONVERSION Converts columns data types from one to another type. It stands for Explicit Column Conversion. DATAMININGQUERY– Used to perform data mining query against analysis services and manage Predictions Graphs and Controls. DERIVEDCOLUMN Create a new (computed) column from given expressions. EXPORTCOLUMN– Used to export a Image specific column from the database to a flat file. FUZZYGROUPING– Used for data cleansing by finding rows that are likely duplicates. FUZZYLOOKUP Used for Pattern Matching and Ranking based on fuzzy logic. IMPORTCOLUMN Reads image specific column from database onto a flat file. LOOKUP Performs the lookup (searching) of a given reference object set against a data source. It is used for exact matches only. MERGE Merges two sorted data sets into a single data set into a single data flow. MERGEJOIN Merges two data sets into a single dataset using a join junction. MULTICAST Sends a copy of supplied Data Source onto multiple Destinations. ROWCOUNT Stores the resulting row count from the data flow / transformation into a variable. ROWSAMPLING Captures sample data by using a row count of the total rows in dataflow specified by rows or percentage. UNIONALL Merge multiple data sets into a single dataset. PIVOT– Used for Normalization of data sources to reduce analomolies by converting rows into columns UNPIVOT– Used for denormalizing the data structure by converts columns into rows incase of building Data Warehouses. Q.How to log SSIS Executions? Ans: SSIS includes logging features that write log entries when runtime events occur and can also write custom messages. This is not enabled by default. Integration Services supports a diverse set of log providers, and gives you the ability to create custom log providers. The Integration Services log providers can write log entries to text files, SQL Server Profiler, SQL Server, Windows Event Log, or XML files. Logs are associated with packages and are configured at the package level. Each task or container in a package can log information to any package log. The tasks and containers in a package can be enabled for logging even if the package itself is not. Q.How do you deploy SSIS packages? Ans: SSIS Project BUILD provides a Deployment Manifest File. We need to run the manifest file and decide whether to deploy this onto File System or onto SQL Server . SQL Server Deployment is very faster and more secure then File System Deployment. Alternatively, we can also import the package from SSMS from File System or SQL Server. Q.What are variables and what is variable scope? Ans: Variables store values that a SSIS package and its containers, tasks, and event handlers can use at run time. The scripts in the Script task and the Script component can also use variables. The precedence constraints that sequence tasks and containers into a workflow can use variables when their constraint definitions include expressions. Integration Services supports two types of variables: userdefined variables and system variables. User defined variables are defined by package developers, and system variables are defined by Integration Services. You can create as many userdefined variables as a package requires, but you cannot create additional system variables. Q.Can you name five of the Perfmon counters for SSIS and the value they provide? Ans: SQLServer:SSIS Service SSIS Package Instances SQLServer:SSIS Pipeline BLOB bytes read BLOB bytes written BLOB files in use Buffer memory Buffers in use Buffers spooled Flat buffer memory Flat buffers in use Private buffer memory Private buffers in use Rows read Rows written SSAS SQL Server Analysis Services Q.What is Analysis Services? List out the features? Ans: Microsoft SQL Server 2014 Analysis Services (SSAS) delivers online analytical processing (OLAP) and data mining functionality for business intelligence applications. Analysis Services supports OLAP by letting us design, create, and manage multidimensional structures that contain data aggregated from other data sources, such as relational databases. For data mining applications, Analysis Services lets we design, create, and visualize data mining models that are constructed from other data sources by using a wide variety of industrystandard data mining algorithms. Analysis Services is a middle tier server for analytical processing, OLAP, and Data mining. It manages multidimensional cubes of data and provides access to heaps of information including aggregation of data. One can create data mining models from data sources and use it for Business Intelligence also including reporting features. Analysis service provides a combined view of the data used in OLAP or Data mining. Services here refer to OLAP, Data mining. Analysis services assists in creating, designing and managing multidimensional structures containing data from varied sources. It provides a wide array of data mining algorithms for specific trends and needs. Some of the key features are: Ease of use with a lot of wizards and designers. Flexible data model creation and management Scalable architecture to handle OLAP Provides integration of administration tools, data sources, security, caching, and reporting etc. Provides extensive support for custom applications Q.What is UDM? Its significance in SSAS? Ans: The role of a Unified Dimensional Model (UDM) is to provide a bridge between the user and the data sources. A UDM is constructed over one or more physical data sources, and then the end user issues queries against the UDM using one of a variety of client tools, such as Microsoft Excel. At a minimum, when the UDM is constructed merely as a thin layer over the data source, the advantages to the end user are a simpler, more readily understood model of the data, isolation from heterogeneous backend data sources, and improved performance for summary type queries. In some scenarios a simple UDM like this is constructed totally automatically. With greater investment in the construction of the UDM, additional benefits accrue from the richness of metadata that the model can provide. The UDM provides the following benefits: Allows the user model to be greatly enriched. Provides high performance queries supporting interactive analysis, even over huge data volumes. Allows business rules to be captured in the model to support richer analysis. Q.What is the need for SSAS component? Ans: Analysis Services is the only component in SQL Server using which we can perform Analysis and Forecast operations. SSAS is very easy to use and interactive. Faster Analysis and Troubleshooting. Ability to create and manage Data warehouses. Apply efficient Security Principles. Q.Explain the TWOTier Architecture of SSAS? Ans: SSAS uses both server and client components to supply OLAP and data mining functionality BI Applications. The server component is implemented as a Microsoft Windows service. Each instance of Analysis Services implemented as a separate instance of the Windows service. Clients communicate with Analysis Services using the standard XMLA (XML For Analysis) , protocol for issuing commands and receiving responses, exposed as a web service. Q.What are the components of SSAS? Ans: An OLAP Engine is used for enabling fast adhoc queries by end users. A user can interactively explore data by drilling, slicing or pivoting. Drilling refers to the process of exploring details of the data. Slicing refers to the process of placing data in rows and columns. Pivoting refers to switching categories of data between rows and columns. In OLAP, we will be using what are called as Dimensional Databases. Q.What is FASMI ? Ans: A database is called an OLAP Database if the database satisfies the FASMI rules : Fast Analysis– is defined in the OLAP scenario in five seconds or less. Shared – Must support access to data by many users in the factors of Sensitivity and Write Backs. Multidimensional – The data inside the OLAP Database must be multidimensional in structure. Information – The OLAP database Must support large volumes of data.. Q.What languages are used in SSAS ? Ans: Structured Query Language (SQL) Multidimensional Expressions (MDX) an industry standard query language orientated towards analysis Data Mining Extensions (DMX) an industry standard query language oriented toward data mining. Analysis Services Scripting Language (ASSL) used to manage Analysis Services database objects. Q.How Cubes are implemented in SSAS ? Ans: Cubes are multidimensional models that store data from one or more sources. Cubes can also store aggregations SSAS Cubes are created using the Cube Wizard. We also build Dimensions when creating Cubes. Cubes can see only the DSV( logical View). MSBI Interview Questions MSBI Interview Questions and Answers Q.While creating a new calculated member in a cube what is the use of property called nonempty behavior? Ans: Nonempty behavior is important property for ratio calculations. If the denominator Is empty, an MDX expression will return an error just as it would if the denominator Were equal to zero. By selecting one or more measures for the NonEmpty Behavior property, we are establishing a requirement that each selected measure first be evaluated before the calculation expression is evaluated. If each selected measure is empty, then The expression is also treated as empty and no error is returned. Q.What is a RAGGED hierarchy? Ans: Under normal circumstances, each level in a hierarchy in Microsoft SQL Server Analysis Services (SSAS) has the same number of members above it as any other member at the same level. In a ragged hierarchy, the logical parent member of at least one member is not in the level immediately above the member. When this occurs, the hierarchy descends to different levels for different drilldown paths. Expanding through every level for every drilldown path is then unnecessarily complicated. Q.What are the roles of an Analysis Services Information Worker? Ans: The role of an Analysis Services information worker is the traditional "domain expert" role in business intelligence (BI) someone who understands the data employed by a solution and is able to translate the data into business information. The role of an Analysis Services information worker often has one of the following job titles: Business Analyst (Report Consumer), Manager (Report Consumer), Technical Trainer, Help Desk/Operation, or Network Administrator. Q.What are the different ways of creating Aggregations? Ans: We can create aggregations for faster MDX statements using Aggregation Wizard or thru UBO – Usage Based Optimizations. Always, prefer UBO method in realtime performance troubleshooting. Q.What is WriteBack? What are the preconditions? Ans: The Enable/Disable Writeback dialog box enables or disables writeback for a measure group in a cube. Enabling writeback on a measure group defines a writeback partition and creates a writeback table for that measure group. Disabling writeback on a measure group removes the writeback partition but does not delete the writeback table, to avoid unanticipated data loss. Q.What is processing? Ans: Processing is a critical and resource intensive operation in the data warehouse lifecycle and needs to be carefully optimized and executed. Analysis Services offers a high performance and scalable processing architecture with a comprehensive set of controls for database administrators. We can process an OLAP database, individual cube, Dimension or a specific Partition in a cube. Q.Name few Business Analysis Enhancements for SSAS? Ans: The following table lists the business intelligence enhancements that are available in Microsoft SQL Server Analysis Services (SSAS). The table also shows the cube or dimension to which each business intelligence enhancement applies, and indicates whether an enhancement can be applied to an object that was created without using a data source and for which no schema has been generated. Enhancement Type Applied to No data source Time Intelligence Cube Cube No Account Intelligence Dimension Dimension or cube No Dimension Intelligence Dimension Dimension or cube Yes Custom Aggregation Dimension Dimension (unary operator) or cube No Semiadditive Behavior Cube Cube Yes Custom Member Formula Dimension Dimension or cube No Custom Sorting and Uniqueness Settings Dimension Dimension or cube Yes Dimension Writeback Dimension Dimension or cube Yes Q.What MDX functions do you most commonly use? Ans: This is a great question because you only know this answer by experience. If you ask me this question, the answer practically rushes out of me. “CrossJoin, Descendants, and NonEmpty, in addition to Sum, Count, and Aggregate. My personal favorite is CrossJoin because it allows me identify noncontiguous slices of the cube and aggregate even though those cube cells don’t roll up to a natural ancestor.” Indeed, CrossJoin has easily been my bread and butter. Q.Where do you put calculated members? Ans: The reflexive answer is “in the Measures dimension” but this is the obvious answer. So I always follow up with another question. “If you want to create a calculated member that intersects all measures, where do you put it?” A high percentage of candidates can’t answer this question, and the answer is “In a dimension other than Measures.” If they can answer it, I immediately ask them why. The answer is “Because a member in a dimension cannot intersect its own relatives in that dimension.” Q.How do I find the bottom 10 customers with the lowest sales in 2003 that were not null? Ans: Simply using bottomcount will return customers with null sales. You will have to combine it with NONEMPTY or FILTER. SELECT { . } ON COLUMNS , BOTTOMCOUNT( NONEMPTY(DESCENDANTS( .. .. ) ( . ) ) 10 ( . ) ) ON ROWS FROM WHERE ( ...& ) ; Q.How in MDX query can I get top 3 sales years based on order quantity? Ans: By default Analysis Services returns members in an order specified during attribute design. Attribute properties that define ordering are "OrderBy" and "OrderByAttribute". Lets say we want to see order counts for each year. In Adventure Works MDX query would be: SELECT {.} ON 0 ...Members ON 1 FROM ; Same query using TopCount: SELECT {.} ON 0, TopCount(...Members,3, .) ON 1 FROM ; Q.How do you extract first tuple from the set? Ans: Use could usefunctionSet.Item(0) Example: SELECT {{...Members }.Item(0)} ON 0 FROM Q.How can I setup default dimension member in Calculation script? Ans: You can use ALTER CUBE statement. Syntax: ALTER CUBE CurrentCube | YourCubeName UPDATE DIMENSION , DEFAULT_MEMBER=''; SSRS SQL Server Reporting Services Q.What is SSRS? Ans: SQL Server Reporting Service is one of the serverbased software systems that generate reports developed by Microsoft. It is used for preparing and delivering interactive and variety of printed reports. It is administered through an interface that is web based. Reporting services utilizes a web service interface for supporting and developing of customized reporting applicatons. It can be competed with Crystal Reports and other business intelligence tools. Q.Explain SSRS Architecture? Ans: Reporting services architecture comprises of integrated components. It is a multitiered, included with application, server and data layers. This architecture is scalable and modular. A single installation can be used across multiple computers. It includes the following components: Report Manager, Reporting Designer, Browser Types Supported by Reporting services, Report server, Report server command line utilities, Report Server Database, Reporting Services Extensibility, Data sources that is supported by Reporting Services. Q.Explain Reporting Life Cycle? Ans: The Reporting Lifecycle includes Report designing – The designing is done in Visual Studio Report Designer. It generates a class which embodies the Report Definition. Report processing – The processing includes binging the report definition with data from the report data source. It performs on all grouping, sorting and filtering calculations. The expressions are evaluated except the page header, footer and section items. Later it fires the Binding event and Bound event. As a result of the processing, it produces Report Instance. Report instance may be persisted and stored which can be rendered at a later point of time. Report Rendering: Report rendering starts by passing the Report Instance to a specific rendering extension (HTML or PDF formats). The instance of reports is paged if paging supported by output format. The expressions of items are evaluated in the page header and footer sections for every page. As a final step, the report is rendered to the specific output document. Q.How to finetune Reports? Ans: To tuneup the Reporting Services, follow the below mentioned ways: Expand the Server or utilizing the reporting services of another database server. For better embedding of report contents, report application’s logic and characteristics can have a duplicate copy of data. Replication of data continuously. Using nolock, the issues of locking can well be resolved and the performance of the query can be improved. This can be done by using dirty read at the time of duplicating the data is unavailable. Q.What are Data Driven Subscriptions? Ans: Reporting Services provides datadriven subscriptions so that you can customize the distribution of a report based on dynamic subscriber data. Data driven subscriptions are intended for the following kinds of scenarios: Distributing reports to a large recipient pool whose membership may change from one distribution to the next. For example distribute a monthly report to all current customers. Distributing reports to a specific group of recipients based on predefined criteria. For example send a sales performance report to the top ten sales managers in an organization. Q.Difference between Logical Page an Physical Page in SSRS. Ans: Logical page breaks are page breaks that you insert before or after report items or groups. Page breaks help to determine how the content is fitted to a report page for optimal viewing when rendering or exporting the report. The following rules apply when rendering logical page breaks: Logical page breaks are ignored for report items that are constantly hidden and for report items where the visibility is controlled by clicking another report item. Logical page breaks are applied on conditionally visible items if they are currently visible at the time the report is rendered. Space is preserved between the report item with the logical page break and its peer report items. Logical page breaks that are inserted before a report item push the report item down to the next page. The report item is rendered at the top of the next page. Logical page breaks defined on items in table or matrix cells are not kept. This does not apply to items in lists. Q.When to Use Null Data driven Subscription? Ans: Create a datadriven subscription that uses the Null Delivery Provider. When you specify the Null Delivery Provider as the method of delivery in the subscription, the report server targets the report server database as the delivery destination and uses a specialized rendering extension called the null rendering extension. In contrast with other delivery extensions, the Null Delivery Provider does not have delivery settings that you can configure through a subscription definition. Q.How does the report manager work in SSRS? Ans: Report manager is a web application. In SSRS it is accessed by a URL. The interface of this Report manager depends on the permissions of the user. This means to access any functionality or perform any task, the user must be assigned a role. A user with a role of full permissions can entire all the features and menus of the report. To configure the report manager, a URL needs to be defined. Q.What are the Reporting Services components? Ans: Reporting services components assist in development. These processing components include some tools that are used to create, manage and view reports. A report designer is used to create the reports. a report sever is used to execute and distribute reports. a report manager is used to manage the report server. Q.SQL Server Reporting Services vs Crystal Reports. Ans: Crystal reports are processed by IIS while SSRS have a report server. Caching in Crystal reports is available through cache server. On the other hand, caching in SSRS is available for Report history snapshots. Crystal reports have standards and user defined field labels. SSRS allows only user defined field labels. Q.What is Report Builder? Ans: Report Builder is a businessuser, adhoc report design client that allows users to design reports based on the business terms (Report Builder model) they are familiar with, but without needing to understand database schemas or how to write SQL or MDX queries. Report Builder works with both SQL Server and Analysis Services data sources. Q.How does Report Builder support Analysis Services cubes? Ans: Report Builder supports relational SQL and Analysis Services data sources in SQL Server. To create a model for Analysis Services cube, go to Report Manager or Management Studio, create a data source for your Analysis Services database, and then select the Generate Model option to create the model. Q.How do users use Report Builder with SQL Server data sources? Ans: While models that provide access to SQL Server Analysis Services are automatically generated on the report server, the Report Builder Model Designer can be used to generate or modify the models that are built on top of SQL Server relational databases. These modelbuilding projects are a new type of project within a Visual Studio–based development shell. Q.How do I get Report Builder to generate a parameter that can be set by users viewing the report? Ans: In the filter dialog box, click the name of the criteria that you would like to prompt the user for when viewing the report. For example, for the criteria Order Year=2000, click Order Year. Select the Prompt option in the dropdown list. Q.What new data source types were added in SSRS 2014? Ans: In addition to the data source types available in SSRS (SQL Server, Oracle, ODBC, OLE DB), the following have been added in SSRS 2012: SQL Server Analysis Services SQL Server Integration Services SQL Server Report Builder Models XML (through URL and Web services) Q.How can I add Reporting Services reports to my application? Ans: Visual Studio / SSDT / BI Data Tools (Standard and Enterprise editions) contains a set of freely redistributable Report Viewer controls that make it easy to embed Reporting Services functionality into custom applications. Two versions of the Report Viewer exist, one for rich Windows client applications and one for ASP.NET applications. Q.Do I need a report server to run reports in my application? Ans: In addition to publishing reports to a report server, you can build reports using the Report Designer that is directly integrated with Visual Studio language projects. You can embed reports directly in any Windows Forms or ASP.NET Web application without access to a report server. The data access in embedded reports is a natural extension of the Visual Studio data facilities. Not only can you use traditional databases as a source of data for your reports, you can use object collections as well. Q.Can you import Microsoft Excel data to SSRS? Ans: Reporting Services does not import data. It only queries data in whatever format it is stored in their native storage system. I will assume that you're asking whether you can create reports and use Excel spreadsheets as data sources. The answer is Yes, Reporting Services supports a wide variety of data sources, including Excel files. You'll get the best performance with the builtin native .NET providers but you should be able to connect to any ODBC or OLEDB data source, whether it comes from Microsoft or a thirdparty company. Q.Can we deploy SSRS reports on our personal website? Ans: Your reports can only be deployed on a reporting services site. Your only option for viewing them from other sites is an HTTP link. Some tools, like SharePoint offer controls allowing you to view reports in the context of the other websites, but the report is still deployed to and hosted from reporting services. Q.Can we use datagrids for our report in SSRS? Ans: We have an ASP.NET project that populates a datagrid. Using datagrid as my datasource for my report using SQL Server Reporting Services. Is this possible? The simple answer is no. However, nothing's ever simple. A set of reporting controls was added in Visual Studio 2010 allowing you to report in a dataset, on data that was supplied by you. So, if you retrieved your data into a dataset, bound the datagrid to the dataset so it had data to display, you could then use that dataset as the datasource for the reporting controls. These are then clientside reports, not server reports though. Q.What are the drawbacks of reporting in SSRS? Ans: For many years, Microsoft had no direct solution for reporting with the SQL Server besides Crystal Reports. Now, they have SQL Server Reporting Services, but it does have several drawbacks. It is still complex to understand the complete functionality and structure of this new component, and many users are still relying on the reporting application they are more familiar with, which is Crystal Reports. Also, components in SSRS like Report Builder and Report Designer are meant for different users for different aspects of the report process, yet complete understanding and exposure to both is important to utilize both functions fully and extensively. There are also issues when exporting very large reports to Microsoft Excel, as it can lead to a loss of data. Q.Will running SSRS on Windows XP limit the number of users? Ans: Yes, but not because of SSRS. The Internet Information Services (IIS) component of Windows XP only allows a small number of users to connect to the website at once. As SSRS runs via IIS, this would prevent more than a few people from using SSRS at once. Also, the only edition of SSRS that will install on Windows XP is the Developer Edition. This edition can not be used for production use. You need Standard or Enterprise Edition for production use, which requires a Server OS to install on (Windows 2003 Standard, Windows 2008 Standard, etc). Q.Are there issues when exporting SSRS reports into Microsoft Excel? Ans: When my users are trying to export a SSRS report into Microsoft Excel, one or two columns in the report appear to merge together. Why might this be? Exporting from SSRS is not always perfect, even if you stay within the Microsoft range of products. If you have extra resources, you could splurge for an addon that offers much better control over exporting to Excel, such as OfficeWriter. From my experience, though, it is usually headers or footers that cause exporting issues. If any of these headers or footers overlap with data columns in your report, you will find that the exported version of the report has merged cells. Also, check columns next to each other to make sure that there is no overlap, as well. Q.How to send a SSRS report from SSIS? Ans: Often there is a requirement to be able to send a SSRS report in Excel, PDF or another format to different users from a SSIS package one it has finished performing a data load. In order to do this, first you need to create a subscription to the report. You can create a SSRS report subscription from Report Manager. At the report subscription you can mention the report format and the email address of the recipient. When you create a schedule for the SSRS report, a SQL Server Agent Job will be created. From the SSIS, by using sp_start_job and passing the relevant job name you can execute the SSRS report subscription. contact for more information on MSBI Online Training
Continue reading
MySQL DBA Interview Questions
Q. What Is Mysql? Ans: MySQL is a multithreaded, multi-user SQL database management system which has more than 11 million installations. This is the world's second most popular and widely used open source database. Q. In Which Language Mysql Is Written? Ans: MySQL is written in C and C++ and its SQL parser is written in yacc. Q. What Are The Technical Specification Of Mysql? Ans: MySQL has the following technical specifications - Flexible structure High performance Manageable and easy to use Replication and high availability Security and storage management Q. What Is The Difference Between Mysql And SQL? Ans: SQL is known as a standard query language. It is used to interact with the database like MySQL. MySQL is a database that stores various types of data and keeps it safe. A PHP script is required to store and retrieve the values inside the database. Q. What Is The Difference Between Database And Table? Ans: There is a major difference between a database and a table. The differences are as follows: Tables are a way to represent the division of data in a database while, database is a collection of tables and data. Tables are used to group the data in relation with each other and create a dataset. This dataset will be used in the database. The data which are stored in the table in any form is a part of the database, but the reverse is not true. Q. Why Do We Use Mysql Database Server? Ans: The MySQL database server is very fast, reliable and easy to use. You can easily use and modify the software. MySQL software can be downloaded free of cost from the internet. Q. What Are The Different Tables Present In Mysql? Ans: There are many tables that remain present by default. But, MyISAM is the default database engine used in MySQL. There are five types of tables that are present: MyISAM Heap Merge INNO DB ISAM Q. What Is The Difference Between Char And Varchar? Ans: A list of differences between CHAR and VARCHAR: CHAR and VARCHAR types are different in storage and retrieval. CHAR column length is fixed to the length that is declared while creating table. The length value ranges from 1 and 255. When CHAR values are stored then they are right padded using spaces to specific length. Trailing spaces are removed when CHAR values are retrieved. Q. What Is The Difference Between Truncate And Delete In Mysql? Ans: The DELETE command is used to delete data from a table. It only deletes the rows of data from the table while, truncate is very dangerous command and should be used carefully because it deletes every row permanently from a table. Q. How Many Triggers Are Possible In Mysql? Ans: There are only six Triggers allowed to use in MySQL database. Before Insert After Insert Before Update After Update Before Delete After Delete Q. What Is Heap Table? Ans: Tables that are present in memory is known as HEAP tables. When you create a heap table in MySQL, you should need to specify the TYPE as HEAP. These tables are commonly known as memory tables. They are used for high speed storage on temporary basis. They do not allow BLOB or TEXT fields. Q. What Is Blob And Text In Mysql? Ans: BLOB is an acronym stands for a binary large object. It is used to hold a variable amount of data. There are four types of BLOB. TINYBLOB BLOB MEDIUMBLOB LONGBLOB The differences among all these are the maximum length of values they can hold. TEXT is a case-insensitive BLOB. TEXT values are non-binary strings (character string). They have a character set and values are stored and compared based on the collation of the character set. There are four types of TEXT. TINYTEXT TEXT MEDIUMTEXT LONGTEXT Q. What Is A Trigger In Mysql? Ans: A trigger is a set of codes that executes in response to some events. Q. What Is The Difference Between Heap Table And Temporary Table? Ans: Heap tables: Heap tables are found in memory. They are used for high speed storage on temporary basis. They do not allow BLOB or TEXT fields. Heap tables do not support AUTO_INCREMENT. Indexes should be NOT NULL. Temporary tables: The temporary tables are used to keep the temporary data. Sometimes it is very useful in cases to keep temporary data. Temporary table is deleted after current client session terminates. Main differences: The heap tables are shared among clients while temporary tables are not shared. Heap tables are just another storage engine, while for temporary tables you need a special privilege (create temporary table). Q. What Is The Difference Between Float And Double? Ans: FLOAT stores floating point numbers with accuracy up to 8 places and allocates 4 bytes, on the other hand DOUBLE stores floating point numbers with accuracy up to 18 places and allocates 8 bytes. Q. What Are The Advantages Of Mysql In Comparison To Oracle? Ans: MySQL is a free, fast, reliable, open source relational database while Oracle is expensive, although they have provided Oracle free edition to attract MySQL users. MySQL uses only just under 1 MB of RAM on your laptop while Oracle 9i installation uses 128 MB. MySQL is great for database enabled websites while Oracle is made for enterprises. MySQL is portable. Q. What Are The Disadvantages Of Mysql? Ans: MySQL is not so efficient for large scale databases. It does not support COMMIT and STORED PROCEDURES functions version less than 5.0. Transactions are not handled very efficiently. Q. What Is The Difference Between Mysql_connect And Mysql_pconnect? Ans: Mysql_connect: It opens a new connection to the database. Every time you need to open and close database connection, depending on the request. Opens page every time when it loaded. Mysql_pconnect: In Mysql_pconnect, "p" stands for persistent connection so it opens the persistent connection. the database connection can not be closed. it is more useful if your site has more traffic because there is no need to open and close connection frequently and every time when page is loaded. Q. What Does " I_am_a_dummy Flag" Do In Mysql? Ans: The " i_am_a_dummy flag" enables MySQL engine to refuse any UPDATE or DELETE statement to execute if the WHERE clause is not present. Q. How To Get The Current Date In Mysql? Ans: To get current date, use the following syntax: SELECT CURRENT_DATE(); Q. What Are The Security Alerts While Using Mysql? Ans: Install antivirus and configure the operating system's firewall. Never use the MySQL Server as the UNIX root user. Change root username and password Restrict or disable remote access. Q. How To Change A Password For An Existing User Via Mysqladmin? Ans: Mysqladmin -u root -p password "newpassword". Q. What Is The Difference Between Unix Timestamps And Mysql Timestamps? Ans: Actually both Unix timestamp and MySQL timestamp are stored as 32-bit integers but MySQL timestamp is represented in readable format of YYYY-MM-DD HH:MM:SS format. Q. How To Display Nth Highest Salary From A Table In A Mysql Query? Ans: Let us take a table named employee. To find Nth highest salary is: 1. select distinct(salary) from employee order by salary desc limit n-1,1 if you want to find 3rd largest salary: 1. select distinct(salary) from employee order by salary desc limit 2,1 Q. What Is Mysql Default Port Number? Ans: MySQL default port number is 3306. Q. What Is Regexp? Ans: REGEXP is a pattern match using regular expression. Regular expression is a powerful way of specifying a pattern for a complex search. Q. How Many Columns Can You Create For An Index? Ans: You can create maximum of 16 indexed columns for a standard table. Q. What Is The Difference Between Now() And Current_date()? Ans: NOW() command is used to show current year, month, date with hours, minutes and seconds while CURRENT_DATE() shows the current year with month and date only. Q. Which Command Is Used To View The Content Of The Table In Mysql? Ans: The SELECT command is used to view the content of the table in MySQL. Q. What Is The Usage Of I-am-a-dummy Flag In Mysql? Ans: In MySQL, the i-am-a-dummy flag makes the MySQL engine to deny the UPDATE and DELETE commands unless the WHERE clause is present. Q. What Is The Usage Of Regular Expressions In Mysql? Ans: In MySQL, regular expressions are used in queries for searching a pattern in a string. * Matches 0 more instances of the string preceding it. + matches 1 more instances of the string preceding it. ? Matches 0 or 1 instances of the string preceding it. . Matches a single character. matches a or b or z | separates strings ^ anchors the match from the start. "." Can be used to match any single character. "|" can be used to match either of the two strings REGEXP can be used to match the input characters with the database. Example: The following statement retrieves all rows where column employee_name contains the text 1000 (example salary): Select employee_name From employee Where employee_name REGEXP '1000' Order by employee_name Q. How Do You Determine The Location Of Mysql Data Directory? Ans: The default location of MySQL data directory in windows is C:mysqldata or C:Program FilesMySQLMySQL Server 5.0 data. Q. What Is Mysql Data Directory? Ans: MySQL data directory is a place where MySQL stores its data. Each subdirectory under this data dictionary represents a MySQL database. By default the information managed my MySQL = server mysqld is stored in data directory. Q. What Is The Use Of Mysql_close()? Ans: Mysql_close() cannot be used to close the persistent connection. Though it can be used to close connection opened by mysql_connect(). Q. How Is Myisam Table Stored? Ans: MyISAM table is stored on disk in three formats. '.frm' file : storing the table definition '.MYD' (MYData): data file '.MYI' (MYIndex): index file Q. What Is The Usage Of Enums In Mysql? Ans: ENUMs are used to limit the possible values that go in the table: For example: CREATE TABLE months (month ENUM 'January', 'February', 'March'); INSERT months VALUES ('April'). Q. What Are The Advantages Of Myisam Over Innodb? Ans: MyISAM follows a conservative approach to disk space management and stores each MyISAM table in a separate file, which can be further compresses, if required. On the other hand, InnoDB stores the tables in tablespace. Its further optimization is difficult. Q. What Are The Differences Between Mysql_fetch_array(), Mysql_fetch_object(), Mysql_fetch_row()? Ans: Mysql_fetch_object is used to retrieve the result from the database as objects while mysql_fetch_array returns result as an array. This will allow access to the data by the field names. For example: Using mysql_fetch_object field can be accessed as $result->name. Using mysql_fetch_array field can be accessed as $result->. Using mysql_fetch_row($result) where $result is the result resource returned from a successful query executed using the mysql_query() function. Example: 1.$result = mysql_query("SELECT * from students"); 2.while($row = mysql_fetch_row($result)) 3.{ 4.Some statement; 5.} Q. How Do You Backup A Database In Mysql? Ans: It is easy to backing up data with phpMyAdmin. Select the database you want to backup by clicking the database name in the left hand navigation bar. Then click the export button and make sure that all tables are highlighted that you want to backup. Then specify the option you want under export and save the output. Q. What Is Sqlyog? Ans: SQLyog program is the most popular GUI tool for admin. It is the most popular MySQL manager and admin tool. It combines the features of MySQL administrator, phpMyadmin and others MySQL front ends and MySQL GUI tools. Contact for more on Mysql DBA Online Training
Continue reading
MySQL Interview Questions
Q.How do you start and stop MySQL on Windows? Ans: net start MySQL, net stop MySQL Q.How do you start MySQL on Linux? Ans: /etc/init.d/mysql start Q.Explain the difference between mysql and mysqli interfaces in PHP? Ans: mysqli is the object-oriented version of mysql library functions. Q.What’s the default port for MySQL Server? Ans: 3306 Q.What does tee command do in MySQL? Ans: tee followed by a filename turns on MySQL logging to a specified file. It can be stopped by command notee. Q.Can you save your connection settings to a conf file? Ans: Yes, and name it ~/.my.conf. You might want to change the permissions on the file to 600, so that it’s not readable by others. Q.How do you change a password for an existing user via mysqladmin? Ans: mysqladmin -u root -p password “newpassword” Q.Use mysqldump to create a copy of the database? Ans: mysqldump -h mysqlhost -u username -p mydatabasename > dbdump.sql Q.What are some good ideas regarding user security in MySQL? Ans: There is no user without a password. There is no user without a user name. There is no user whose Host column contains % (which here indicates that the user can log in from anywhere in the network or the Internet). There are as few users as possible (in the ideal case only root) who have unrestricted access. Q.Explain the difference between MyISAM Static and MyISAM Dynamic. Ans: In MyISAM static all the fields have fixed width. The Dynamic MyISAM table would include fields such as TEXT, BLOB, etc. to accommodate the data types with various lengths. MyISAM Static would be easier to restore in case of corruption, since even though you might lose some data, you know exactly where to look for the beginning of the next record. Q.What does myisamchk do? Ans: It compressed the MyISAM tables, which reduces their disk usage. Q.Explain advantages of InnoDB over MyISAM? Ans: Row-level locking, transactions, foreign key constraints and crash recovery. Q.Explain advantages of MyISAM over InnoDB? Ans: Much more conservative approach to disk space management – each MyISAM table is stored in a separate file, which could be compressed then with myisamchk if needed. With InnoDB the tables are stored in tablespace, and not much further optimization is possible. All data except for TEXT and BLOB can occupy 8,000 bytes at most. No full text indexing is available for InnoDB. TRhe COUNT(*)s execute slower than in MyISAM due to tablespace complexity. Q.What are HEAP tables in MySQL? Ans: HEAP tables are in-memory. They are usually used for high-speed temporary storage. No TEXT or BLOB fields are allowed within HEAP tables. You can only use the comparison operators = and . HEAP tables do not support AUTO_INCREMENT. Indexes must be NOT NULL. Q.How do you control the max size of a HEAP table? Ans: MySQL config variable max_heap_table_size. Q.What are CSV tables? Ans: Those are the special tables, data for which is saved into comma-separated values files. They cannot be indexed. Q.Explain federated tables. Ans: Introduced in MySQL 5.0, federated tables allow access to the tables located on other databases on other servers. Q.What is SERIAL data type in MySQL? Ans: BIGINT NOT NULL PRIMARY KEY AUTO_INCREMENT Q.What happens when the column is set to AUTO INCREMENT and you reach the maximum value for that table? Ans: It stops incrementing. It does not overflow to 0 to prevent data losses, but further inserts are going to produce an error, since the key has been used already. Q.Explain the difference between BOOL, TINYINT and BIT. Ans: Prior to MySQL 5.0.3: those are all synonyms. After MySQL 5.0.3: BIT data type can store 8 bytes of data and should be used for binary data. Q.Explain the difference between FLOAT, DOUBLE and REAL. Ans: FLOATs store floating point numbers with 8 place accuracy and take up 4 bytes. DOUBLEs store floating point numbers with 16 place accuracy and take up 8 bytes. REAL is a synonym of FLOAT for now. Q.If you specify the data type as DECIMAL (5,2), what’s the range of values that can go in this table? Ans: 999.99 to -99.99. Note that with the negative number the minus sign is considered one of the digits. Q.What happens if a table has one column defined as TIMESTAMP? Ans: That field gets the current timestamp whenever the row gets altered. Q.But what if you really want to store the timestamp data, such as the publication date of the article? Ans: Create two columns of type TIMESTAMP and use the second one for your real data. Q.Explain data type TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP Ans: The column exhibits the same behavior as a single timestamp column in a table with no other timestamp columns. Q.What does TIMESTAMP ON UPDATE CURRENT_TIMESTAMP data type do? Ans: On initialization places a zero in that column, on future updates puts the current value of the timestamp in. Q.Explain TIMESTAMP DEFAULT ‘2006:09:02 17:38:44′ ON UPDATE CURRENT_TIMESTAMP. - A Ans: default value is used on initialization, a current timestamp is inserted on update of the row. Q.If I created a column with data type VARCHAR(3), what would I expect to see in MySQL table? Ans: CHAR(3), since MySQL automatically adjusted the data type. contact for more on Mysql Online Training
Continue reading
Power BI Interview Questions
Q.Explain Various Parts Of Microsoft Self-service Bi Solution? Ans: There are two main parts of Microsoft self-service Business Intelligence solution. They are: Excel BI Toolkit Power BI Define Microsoft Power BI Q.Define Excel Bi Toolkit? Ans: Allowing users to create interactive report by means of importing data from a wide range of sources and model data acc to requirement. Q.Define Power Bi? Ans: A cloud-based data sharing environment, Power BI allows anyone to analyze and visualize data with greater efficiency, speed and understanding. Besides, it helps in connecting users to a wide range of data with the help of easy-to-operate dashboards, compelling visualizations, and interactive reports bringing data to life. Q.Differentiate Between Power Bi And Power Bi Pro? Ans: Power BI offers various kinds of features to help you get started searching data in a complete new way. On the other hand, Power BI Pro caters with some additional features like scheduling data refresh more often than on daily basis, more storage capacity, live data sources along with complete interactivity, and much more. Q.What Is The Cost For Power Bi? Ans: Both Power BI Desktop and Power BI are free of cost. While for Power BI Pro, a user is required to pay $9.99 per month after completing 60-day free trial. Q.What The Term Power Bi Desktop Means? Ans: Can be installed on your computer, Power BI Desktop is a free app that works in cohesion with the Power BI service by offering advanced data exploration, modelling, shaping, and report creation with the use of highly interactive visualizations. Moreover, it allows you to save your all work to a file for publishing your reports and data to Power BI site for sharing with others. Q.What Are The Basics Needed For Using Power Bi? Ans: To use Power BI, all you need is a web browser and a work email address. Please be informed that work email addresses finishing in .mil and .gov are currently not supported. Q.What Is The Need Of Signing Up With Work Email? Ans: Power BI is not supporting email addresses given by telecommunications providers and consumer email services, thus there is a need of signing up with work email. Q.Name The Work Email Addresses That Are Currently Supported? Ans: Work email addresses that are finishing with .org and .edu are currently supported. Q.Which Pricing Is Available For Power Bi From Academic, Government And Non-profit? Ans: Non-profit pricing is currently available for Power BI only when availing it directly from Microsoft. While Academic and Government pricing for Power BI is provided via the EA, MSOP/Direct and Open licensing programs. Besides, government pricing can also be availed in syndication. Q.Is There Are Support Available For Mobile Devices By Power Bi? Ans: Yes, Power BI supports mobile devices. It has some native apps for iOS devices, Android smartphones, and Windows 10 devices. You can download and install Power BI mobile apps from the following app stores: Google Play Apple App Store Windows Store Q.What Data Sources Can Be Connected To For Power Bi? Ans: There is a wide list of data sources you can connect to for Power BI. They are groups as mentioned below: Connectors to databases and some other datasets like Azure SQL Data from Power BI Desktop files and Excel Content packs for reports, services, and datasets. Moreover, along with establishing a data connection, Power BI offers pre-built reports and dashboards for each of these services. Q. Define Content Packs? Ans: These are pre-built solutions used for popular services as a major part of the Power BI experience. Q.Various Excel Bi Add Ins? Ans: Power Query to find, edit and load external data Power View for designing interactive and visual reports Power Pivot to analyse data modeling Power Map for displaying insights in 3D Map Q.What Must Be Installed To Use Power Bi? Ans: For using the service of Power BI for free, one just need a web browser and work email. With this, you can explore data as well as create reports in Power Bi Desktop. To get Power BI mobile apps, you can head to their respective stores. Google Pay, App Store and Windows Store. Q.How One Can Get Started With Power Bi? Ans: There are some resources to get assistance and get started with Power BI. They are as follows: Webinars Power BI Blog You can get started with an article on Power BI You can get started with a video on YouTube Last but not the least, joining a related community and get answered. Q.What Is The Procedure For Buying Power Bi Pro? Ans: Power BI licenses can be purchased at www.powerbi.com . Besides, you can also get assistance from a Microsoft partner to aid you with the implementation of Power BI. Q.Is The Power Bi Service Accessible On-premises? Ans: No, you can not avail the service as private or internal cloud service. However, with the use of Power BI and Power BI Desktop, one can connect securely to their own on-premises data sources. Q.Which Language Is Used To Calculate Calculated/column Field In Power Pivot? Ans: DAX (Data Analysis expression) is used for calculating calculated/column field in Power Pivot. Q.What Is Dax? Ans: DAX is a formula language used for computing calculated field and column. For doing basic calculation and data analysis, it is used on data in power pivot. DAX supports column values. It is not capable of inserting or modifying data. It cannot be used to calculate rows, though you can calculate measures and calculated columns. Q.Explain Power Pivot Data Model? Ans: It is a model that is made up of data types, tables, columns, and table relations. These data tables are typically constructed for holding data for a business entity. Q.Define Power Query? Ans: It is an ETL tool used for shaping, cleaning and transforming data with the help of intuitive interfaces without using code. With this, You can import data from various sources like from files to databases. Append and join data from a wide range of sources. You can shape data as needed by adding and removing it. Q.Name The Language That Is Used In Power Query. Ans: M-code which is a new programming language is used in Power Query. This language is easy to use and is quite similar to other languages. Also, it is case sensitive. Q.Name The Data Destinations For Power Query. Ans: Two destinations are there for output we receive from power query. They are: Load to Excel Data Model Load to table in a worksheet Q.Explain Power Bi Designer? Ans: Power BI Designer is a standalone app that can be used for making Power BI reports and then uploading it to Powerbi.com. Moreover, it does not need excel. All in all, it is a combination of Power Pivot, Power View and Power Query. Q.Is There Any Process To Refresh Power Bi Reports One Uploaded To Cloud? Ans: Yes, Power Bi reports can be refreshed using Data Management Gateway and Power BI Personal Gateway. Q.Explain The Difference Between Power Bi Personal Gateway And Data Management Gateway? Ans: Power BI Personal Gateway is used for reports that are deployed in Powerbi.com. On the other hand, Data management gateway is an app installed in source data machines in order to help reports to be deployed on SharePoint and can be scheduled to automatic refresh. Q.Name All The Platforms For Which Power Bi App Is Available? Ans: Power BI app is available for: Android iPhone and iPad Windows tablets and Windows Desktops Coming for Windows phone soon Q.Differentiate Between Older And Newer Power Bi? Ans: There are new designing tool used in newer Power BI known as Power BI Desktop. It is a standalone designer which include Power Pivot, Power View and Power Query in back end. On the other hand, Older Power Bi has add-ins for excel. In newer Power Bi, there are more graphs available such as line area chart, combo chart, tree map, water fall, etc. Q.Is It Possible To Have Over One Active Relationship Between Two Tables In Power Pivot Data Model? Ans: No, it is not possible. There can’t be over one active relationship between two tables in power pivot data model. If you want, then it is only possible to have one active relationship and other many inactive. Contact for more on Power BI Online Training
Continue reading
QlikView Interview Questions
Q.Explain Qlikview architecture? Ans: QlikView deployments have three main infrastructure components: 1.QlikView Developer : Is a Windows-based desktop tool that is used by designers and developers to create a) a data extract and transformation model and b) to create the graphical user interface (or presentation layer). 2.QlikView Server (QVS) : Handles the communication between clients and the QlikView applications. It loads QlikView applications into memory and calculates and presents user selections in real time. 3.QlikView Publisher: Loads data from different data sources (oledb/odbc, xml, xls), reduces the QlikView application and distributes to a QVS. Because QlikView Server and Publisher have different roles and handle CPU and memory differently it’s considered a best practice to separate these two components on different servers. Q.Set analysis in qlikview ? Ans: Used for set of groups .. mostly used in arrgeted function like sum (year) used sales of current year VS last year. Q.What is Synthetic key and how to avoid it in QlikView? Ans: It is undesirable to have multiple common keys across multiple tables in a QlikView data structure. This may cause QlikView to use complex keys (a.k.a. synthetic keys) to generate the connections in the data structure. Synthetic keys are generally resource heavy and may slow down calculations and, in extreme cases, overload anapplication. They also make a document harder to understand and maintain. Thereare some cases where synthetic keys cannot be avoided (e.g. Interval Match tables),but, in general, synthetic keys should always be eliminated, if possible. comment the fileds in load script rename the fileds in load script rename the fileds using UNQUALIFY operator; Q.Difference between keep and joins ? Ans : Left Keep and Left join gives the same output. The only difference is that Left Keep keeps the Table separate in Data Model, whereas Left join merges the Tables that are joined. Q.Difference between Straight table and pivot table ? Ans: Pivot Table – 1) A pivot table is better at grouping: you can easily see which group a specific row belongs to, and a group can have a subtotal. 2) You can also display a pivot table like a cross table (one or several horizontal dimensions). 3) But when you sort a pivot table, you have to sort it first according to the first dimension, then according to the next, etc. You cannot sort it any way you want. Straight Table- A straight table is better at sorting than a pivot table: you can sort it according to any column. But it is not so good at grouping. Subtotals are not possible, for instance. Q.Which graph will you used for two years difference sale ? Ans: BAR Graph Q.What is Incremental Load in qlikview? Ans: As BI apps are expected to deal with larger and larger amounts of data the amount of time that it takes to retrieve that data becomes a serious issue. This could be due to shear volume of data or the need for frequent refreshes. Either way, you do not want to be pulling all of the data all of the time. What you want to be able to do is just pull the data that has changed, append to that the data that you stored away previously and then get back to the business of analyzing. This will reduce load on the source database, the network infrastructure and your QlikView server. Q.Whta is Inline memory in QlikView ? Ans: Create table or add field to table . Q.what is Set and let in QlikView and difference between it? Ans: SET or a LET statement is often usedTo define the variable. The SET statement is used when you want a variableto hold the string or numeric value that is to the right of the Equal (=) sign. The LET statement is used when you need to evaluate what is to the right of the Equal sign e.g set myVar=5*2 the result is “5*2″ Let myVar=5*2 the result is “10″ Q.Explain QlikView Resident Load? Ans: Create a new logical table in QlikView, based on a previously loaded (resident) table. Q.What is Apply Map (Mapping Tables)? Ans: Sometimes you need to add an extra field to a table to use a combination of fields from different tables, or you want to add a field to clean up the data structure. Qlik- View has an effective way to add single fields to a table called mapping tables. syntax — mapping ( load statement | select statement ) applymap( ‘mapname’, expr, ) Q.What is Dimensions ( What is difference between Expression and Dimension)? Ans: Each data warehouse consists of dimensions and measures. Dimensions allow data analysis from various perspectives. For example, time dimension could show you the breakdown of sales by year, quarter, month, day and hour. Product dimension could help you see which products bring in the most revenue. Supplier dimension could help you choose those business partners who always deliver their goods on time. Customer dimension could help you pick the strategic set of consumers to whom you’d like to extend your very special offers. Q.Explain about Normalized Data? Ans: Well Structured Form of Data, which doesnt have any repetition or redundancy of data. Its a kind of Relational data. Its mainly used in OLTP kind of stuffs Denormalized Data – Its a whole bunch of data without any relationship among themselves, with redundancy of data. Its mainly used in OLAP kind of stuffs. Q.What Is Star Sechma ? Ans: A star schema is the simplest form of dimensional model, in which data is organized into facts and dimensions. A fact is an event that is counted or measured, such as a sale or login. A dimension contains reference information about the fact, such as date, product, or customer. A star schema is diagramed by surrounding each fact table with its associated dimensions table. The output diagram resembles a star. Star Schema Definition : A means of aggregating data based on a set of known dimensions. It stores data multidimensionality in a two dimensional Relational Database Management System (RDBMS), such as Oracle. Q.What is Snowflaking Schema ? Ans: Snowflake Schema: An extension of the star schema by means of applying additional dimensions to the Dimensions of a star schema in a relational environment. Snowflaking is a form of dimensional modeling; dimensions are stored in multiple relational dimension tables. A snowflake schema is a variation of the star schema. Snowflaking is used to improve the performance of specific queries. The schema is diagramed with each fact surrounded by its associated dimensions as in a star schema, and those dimensions are further related to other dimensions, branching out into a snowflake pattern. Q.What is Central Link Table? Ans: In the event of multiple fact tables QlikView In-Memory Technology allows us to create a central link table that only contains the existing data combinations. Instead of Joining the tables the event dimensions can be merged (CONCATENATED) in to one central Link table. This link table can then be linked back to the event measures one side and the dimension tables on the other. Q.What is binary load ? Ans: Binary load is loading data from another QV file. For example, you have application A.qvw. You can create another application B.qvw with script binary A.qvw. binary file where: file ::= filename Examples: Binary customer.qvw; Binary c:\qv\customer.qvw; The path is the path to the file, either absolute, or relative to the .qvw file containing this script line. Q.What is Container ? Ans: A container object can be used to hold multiple charts. You can use a container object to put multiple charts in the same box. All charts will appear in the same window but only one chart will appear active at a given time. You can click the chart title to switch or toggle between charts. A word of advice: Use containers with caution. They form linked objects and might affect the properties of all linked objects. Q.What is a synthetic key? Ans: It is a field that contains all possible combinations of common fields among tables. Q.What kind of chart we use in Qlikview? Ans: We generally uses bar chart, line chart, combo chart, scatter chart, grid chart, etc. Q.Explain Set analysis in qlikview ?? Ans: It is used for set of groups. Mostly used in arrgeted function like sum (year)etc. Q.Define Trellis chart? Ans: In Trellis chart we can create array of chart based on first dimension. Bitmap chart are also made of trellis display. Q.Explain Mini Chart?.What do you mean by sub reports and how we can create them? Ans: With the help of Mini Chart we can set type of modes instead of values in table mode. We can also change the colors. Q.What is Pivot Table? Ans: Pivot Table: A pivot table is better at the time of grouping. We can also show pivot table like a cross table which is a beneficial feature. But there is one disadvantage of it which is if we have to sort a pivot table than we have to sort it first according to the first dimension then to the next. Q.Which graph we will use for two years difference sale ? Ans: BAR Graph we will use. Q.What is Straight Table? Ans: A straight table is much better at the time of sorting as compared to the pivot table as we can sort it according to any column as per our choice. But it is not good for grouping purpose. Q.How many dimensions we can use in Bar chart? Ans: We can use only two dimension Q.Which Qlikview object has only expression and no dimension? Ans: Gauge chart and list box have only expression and no dimension. Q.How we can use Macros in our application? Ans: We can use macros for various purposes like for reloading the application and to create a object. Q.What do you understand by layers in Qlikview? Ans: The layer are basically set on the sheet object properties layout where bottom, top, normal respective to the number -1,0 and 1. Q.What is Dimensions? Ans: Dimensions allow data examination from various perspectives. Q.Explain about Normalized Data? Ans: Well Structured Form of Data, which doesnt have any repetition or redundancy of data. Its a kind of Relational data. Its mainly used in OLTP kind of stuffs Denormalized Data – Its a whole bunch of data without any relationship among themselves, with redundancy of data. Its mainly used in OLAP kind of stuffs. Q.What Is Star Sechma ? Ans: The simplest form of dimensional model, in which data is prearranged into facts and dimensions is known as Star schema. Q.What is Snowflaking Schema ? Ans: A snowflake schema is a difference of the star schema. Snowflaking is used to improve the presentation of particular queries. Q.Explain interval match? Ans: The internal match is prefixes with the load statement which is used for connecting different numeric values to one or more numeric interval. Q.Explain internal match function()? Ans: Internal match fuction is used to generate data bucket of different sizes. Q.What is Container ? Ans: A container object is used to keep multiple charts. We can use a container object to keep many charts in the same box. Q.What do you understand by extended interval match function()? Ans: Extended interval match function() is used for slowly changing the dimensions. Q.what are the new features in QV 11? Ans: Container Object;Granular Chart Dimension Control; Actions like, clear filed; meta data,etc are the new features in QV 11. Q.Explain joins and its types? Ans: Join is used to convert the given data and whenever we are using joins for converting data its is known ad Data Merging. It has many types: Left join Right join Inner join, etc Q.What is Left Join? Ans: Left join specifies that the join between the two tables should be left join, it uses before the word join. The resulting table only contain the combination among two tables with the full data set from the first table. Q.Define right join? Ans: Right join specifies that the join between the two tables should be right join, it uses before the word join. The resulting table only contain the combination among two tables with the full data set from the second table. Q.Explain Inner Join? Ans: Inner join specifies that the join between the two tables should be inner join. The resulting table should contain the full data set from both the sides. Q.What are modifiers? Ans: Modifiesr deals with the Fields name. For example: sum({$}Sales) Returns the sales for current selection, but with the selection in “Region” is removed. Q.Explain Identifiers Syntax? 0- Represents the empty set 1- Represents the full sets of records $-Represents the record of current selection $1-Represents the previous selection $_1-Represents the next selection Bookmark01-Represents the Bookmark name Q.Explain 3-tier architecture of Qlikview Application? Ans: 1-tier: Raw data is loaded and we create QVD 2-tier: QVD is converted in business login and the requirement of business and data model is created. 3-tier: Reading all QVD from 2-tier and we make a single QVW. Q.How does Qlikview stores the data internally? Ans: Qlikview stores the data in QVD as QVD has data compression capability. Qlikview has better performance than other BI because of its memory analytics approach. Q.Explain the restrictions of Binary load? Ans: Binary Load can be used for only one application means we can only read the data from one QVW application and moreover set scripts is also a restriction. Q.Differentiate betwwen subset Ratio and Information Density. Ans: Subset Ratio: It is used for easily spot problem in key field association.it is only relevant for key fields since they are present in multiple tables and do not share the same value. Information Density: It is the field which contain the percentage of row which contain the non-null value. Q.what is the use of Optimized Load? Ans: Optimized load is much faster and preferable especially for large set of data. It is possible if n o transformation are made at the time of load and no filtering is done. Q.Differentiate between keep and joins? Ans: Keep and joins do the same functions but in keep creates the two tables whereas join only creates the one table. Keep is used before the load or select statements. Q.Define synthetic Key? Ans: Synthetic key is the key where two or more tables consists more than one common column between them is called as synthetic key. Q.What is incremental load in Qlikview? Ans: Incremental load is nothing but loading new or changed records from the database. With the help of QVD files we can use incremental Load. Q.Differentiate between set and let option in Qlikview?? Ans: Set: it assigns the variable without assesses the expression. Let: it assigns the variable with assesses the expression. Q.Define Qlikview Resident Load. Ans: Resident load is a part of loading data in Qlikview application. It is used for loading data in tables which is already loaded in Qlikview application. Q.How we can optimize QV application? Ans: It can be optimized by creating the data into qvds. When complete qvw application is changed into qvd than this qvd will be store in the RAM. Q.What is mapping load? Ans: Mapping load is used to create the mapping table that can be used for replacing field value and field names. Q.Define apply map. Ans: apply map is used to add fields to the tables with the help of other tables. It can be used as joins. Q.What is concatenation? Ans: It means sequence of interconnected things i.e. any column or row which is related to each other can be connected through concatenation. Q.Define NoConcatenation. Ans: NoConcatenation prefix is used to forced the identical tables as two separate internal tables. Q.Define connect statement. Ans: It is used to establish a connection the connection to database with the help of ODBC or OLEDB interface. Q.What do you understand by Fact constellation Schema? Ans: It is a logical database structure of data Warehouse . It is design with the help of De Normalized Fact.. Q.What do you mean by RDBMS? Ans: It stands for relational Database management System. It arrange the data into respective column and rows. Q.What do you understand by the term CAL in Qlikview? Ans: Every client need a CAL to get connected with Qlikview Server. The CALS are taken up with Qlikview Server and tied with the server serial number. Q.Differentiate between QV server and publisher? Ans: QV Server is a program that is installed on computer with various CALS which allow user to access QV Files on the server. Publisher is a program which manages centralized control on our QV files and manages them how and when the are loaded and distributed. Q.What do you understand by snapshot view of the table? Ans: By this option we can see number o tables and related associations. Q.How we can bring data into qv? Ans: We use ODBC, OLEDB, SAP connectors kind of data connections. Q.How we can handle Early Arriving Facts. Ans: We can load data from ODBC, OLEDB, SAP connectors , by select statements and we can also load files like excel, word, etc. by using Table Syntax. Q.What type of data we generally use? Ans: We use flat files, excels, QVDs, etc ad data. Q.Explain about QlikView? Ans: QlikView is the Business Intelligence tool used by the University of St Andrews. Data from different University systems is combined and presented in a single dashboard in an easy and understandable way. QlikView dashboards at the University of St Andrews are built on the following principles: Dashboards must be effective to use Dashboards must support users in carrying out their tasks Dashboards must provide the right kind of functionality It must be easy to learn how to use a dashboard It must be easy to remember how to use a dashboard To use QlikView, you do not need to have technical expertise in information systems, just a willingness to learn how it can support you. Q.What are the benefits of using QlikView? Ans: As the name suggests, QlikView is a combination of quick and click and these features make it intuitive and easy to use. Users can visualize data, search multiple data sets, create ad hoc reports, and view patterns and trends in data that may not have been visible in other reports. QlikView is Flexible – dashboards are web based and accessible from desktop computers and mobile devices Interactive – users are able to drill down and select particular data within charts or tables Usable – users can see large amounts of data effectively and efficiently Scalable – useful for multiple business processes at analytical, operational and strategic levels Q.How is QlikView 11 different from QlikView 10? Ans: QlikView 11 brings new levels of capability and manageability to the QlikView Business Discovery platform. In this release, we focused our investments on five value propositions: Improve collaborative decision making with Social Business Discovery Gain new insights into opportunities and threats and relative business performance with comparative analysis Expand QlikView usage to additional devices, including smartphones, with mobile Business Discovery Enable a broad spectrum of users to jointly develop QlikView apps with QlikView’s rapid analytic app platform capabilities Improve the manageability and performance of QlikView with new enterprise platform capabilities. Q.What is QlikView comparative analysis? Ans: Business users can quickly gain new kinds of insight when analyzing information in QlikView, with new comparative analysis options. App developers can now create multiple selection states in a QlikView app; they can create graphs, tables, or sheets based on different selection sets. Q.What mobile device platforms does QlikView 11 support? Ans: QlikView 11 delivers mobile functionality for Apple iOS and Android tablets and smartphones. QlikView supports Android tablets when the following conditions are met: QlikView Server version 10 SR3 or later The native browser, not a downloaded one Currently our HTML5 web apps support only Apple and Android handhelds. Because many Black Berry are older devices that don’t fully support HTML5 (and many are non-touch), we don’t have a web-based solution for them at this time. Q.What is document-level auditing in QlikView 11? Ans: New optional settings within QlikView Management Console enable administrators to more effectively audit user interactions. Administrators can audit QlikView usage not only at the system level (the entire QlikView Server), but down to the document level. Q.What are the key differences between QlikView and any other standard statistical software package (SAS, SPSS)? Ans: Key difference is in terms of the database used. QlikView offers a quite simple visualization that matches the MS excel filtering. SAS is useful in case of Meta data while SPSS is good for analysis. In comparison of the above three, QlikView is most user friendly and fast in terms of generating diverse dashboards/templates. In terms of calculations, advanced statistics options are limited in QlikView. For market research and analysis SPSS has direct facility algorithms. Q.What are QlikView annotations? Ans: With the new annotations collaboration object QlikView users can engage in threaded discussions about QlikView content. A user can create notes associated with any QlikView object. Other users can then add their own commentary to create a threaded discussion. Users can capture snapshots of their selections and include them in the discussion so others can get back to the same place in the analysis when reviewing notes and comments. QlikView captures the state of the object (current selections), as well as who made each note and comment and when, for a lasting record of how a decision was made. Q.What are the main features of QlikView? Ans: QlikView offers the following features: Dynamic BI Ecosystem Data visualization Interacting with dynamic apps, dashboards and analytics Searching across all data Secure, real-time collaboration Q.What are the differences among QlikView Server editions? Ans: The differences are: QlikView Server Enterprise Edition (EE) is available for customers looking to support a large number of users and integrate into enterprise environments. It includes features such as: Unlimited documents Integration with third party security systems Server clustering Small Business Edition (SBE) is designed to be used in smaller deployments. It has the following limitations: For use only with Named and Document CALs Limited to 25 Named User CALs Limited to 100 Document CALs No support for additional servers Only supports Windows Active Directory to handle security and access control Information Access Server (IAS) is an edition of QlikView Server designed to power public Internet sites. This edition: Includes the add-on QlikView Real Time Server Is licensed for uncapped number of user but limited to one QlikView document Must be set to anonymous mode only and authentication must be off Requires that the QlikView server be on the public Internet and publicly accessible Requires that the URL for accessing the site powered by the QlikView Server be publicly accessible Requires that no QlikView client (e.g., QlikView Desktop, Internet Explorer plug-in, Ajax) can access the QlikView Server (all user interfaces must be built by the customer manually or with QlikView Workbench) QlikView Extranet Server (QES) is an edition of QlikView Server designed to extend QlikView functionality to external users via an extranet. QES: Requires authentication. Users must be external to the purchasing organization (customers, partners, etc.). Restricts server access to the Ajax client and mobile clients Provides the option to customize the QlikView application via the included QlikView Workbench Supports a maximum of 3 QlikView documents Supports session CALs and usage CALs only contact for more on Qlikview Online Training
Continue reading
Spark Interview Questions
Q.What is the difference between Spark and Hadoop? Answer: 1.Speed: Apache Spark: Spark is lightning fast cluster computing tool. Apache Spark runs applications up to 100x faster in memory and 10x faster on disk than Hadoop. Because of reducing the number of read/write cycle to disk and storing intermediate data in-memory Spark makes it possible. Hadoop MapReduce – MapReduce reads and writes from disk, as a result, it slows down the processing speed. 2.Difficulty: Apache Spark – Spark is easy to program as it has tons of high-level operators with RDD – Resilient Distributed Dataset. Hadoop MapReduce – In MapReduce, developers need to hand code each and every operation which makes it very difficult to work. 3.Easy to Manage Apache Spark – Spark is capable of performing batch, interactive and Machine Learning and Streaming all in the same cluster. As a result makes it a complete data analytics engine. Thus, no need to manage different component for each need. Installing Spark on a cluster will be enough to handle all the requirements. Hadoop MapReduce – As MapReduce only provides the batch engine. Hence, we are dependent on different engines. For example- Storm, Giraph, Impala, etc. for other requirements. So, it is very difficult to manage many components. 4.Real-time analysis Apache Spark – It can process real time data i.e. data coming from the real-time event streams at the rate of millions of events per second, e.g. Twitter data for instance or Facebook sharing/posting. Spark’s strength is the ability to process live streams efficiently. Hadoop MapReduce – MapReduce fails when it comes to real-time data processing as it was designed to perform batch processing on voluminous amounts of data. 5.Fault tolerance Apache Spark – Spark is fault-tolerant. As a result, there is no need to restart the application from scratch in case of any failure. Hadoop MapReduce – Like Apache Spark, MapReduce is also fault-tolerant, so there is no need to restart the application from scratch in case of any failure .6.Security Apache Spark – Spark is little less secure in comparison to MapReduce because it supports the only authentication through shared secret password authentication. Hadoop MapReduce – Apache Hadoop MapReduce is more secure because of Kerberos and it also supports Access Control Lists (ACLs) which are a traditional file permission model. Q.Why is Parquet used for Spark SQL? Answer: Parquet is a columnar format, supported by many data processing systems.Apache Parquet as a file format has garnered significant attention recently. Let’s say you have a table with 100 columns, most of the time you are going to access 3-10 columns. In Row oriented format all columns are scanned whether you need them or not. The advantages of having a columnar storage are as follows − Columnar storage limits IO operations. Columnar storage can fetch specific columns that you need to access. Columnar storage consumes less space. Columnar storage gives better-summarized data and follows type-specific encoding. Spark SQL provides support for both reading and writing parquet files that automatically capture the schema of the original data. Like JSON datasets, parquet files follow the same procedure. Apache Parquet saves data in column oriented fashion, so if you need 3 columns, only data of those 3 columns get loaded. Another benefit is that since all data in a given column is the same datatype (obviously), compression quality is far superior. Q.Why Spark, even Hadoop exists? Answer: Below are few reasons. Iterative Algorithm: Generally MapReduce is not good to process iterative algorithms like Machine Learning and Graph processing. Graph and Machine Learning algorithms are iterative by nature and less saves to disk, this type of algorithm needs data in memory to run algorithm steps again and again or less transfers over network means better performance. In Memory Processing: MapReduce uses disk storage for storing processed intermediate data and also read from disks which is not good for fast processing. . Because Spark keeps data in Memory (Configurable), which saves lot of time, by not reading and writing data to disk as it happens in case of Hadoop. Near real-time data processing: Spark also supports near real-time streaming workloads via Spark Streaming application framework. Q.Why both Spark and Hadoop needed? Answer: Spark is often called cluster computing engine or simply execution engine. Spark uses many concepts from Hadoop MapReduce. Both Spark and Hadoop work together well. Spark with HDFS and YARN gives better performance and also simplifies the work distribution on cluster. As HDFS is storage engine for storing huge volume of data and Spark as a processing engine (In memory as well as more efficient data processing). HDFS: It is used as a Storage engine for Spark as well as Hadoop. YARN: It is a framework to manage Cluster using pluggable scedular. Run other than MapReduce: With Spark you can run MapReduce algorithm as well as other higher level of operators for instance map(), filter(), reduceByKey(), groupByKey() etc. Q.How can you use Machine Learning library “SciKit library” which is written in Python, with Spark engine? Answer: Machine learning tool written in Python, e.g. SciKit library, can be used as a Pipeline API in Spark MLlib or calling pipe(). Q.Why Spark is good at low-latency iterative workloads e.g. Graphs and Machine Learning? Answer: Machine Learning algorithms for instance logistic regression require many iterations before creating optimal resulting model. And similarly in graph algorithms which traverse all the nodes and edges. Any algorithm which needs many iteration before creating results can increase their performance when the intermediate partial results are stored in memory or at very fast solid state drives. Spark can cache/store intermediate data in memory for faster model building and training. Also, when graph algorithms are processed then it traverses graphs one connection per iteration with the partial result in memory. Less disk access and network traffic can make a huge difference when you need to process lots of data. Q.Which all are the, ways to configure Spark Properties and order them least important to the most important. Answer: There are the following ways to set up properties for Spark and user programs (in the order of importance from the least important to the most important): conf/spark-defaults.conf - the default --conf - the command line option used by spark-shell and spark-submit SparkConf Q.What is the Default level of parallelism in Spark? Answer: Default level of parallelism is the number of partitions when not specified explicitly by a user. Q.What are the differences between functional and imperative languages, and why is functional programming important? Answer: Following features of Scala makes it uniquely suitable for Spark. Immutability - Immutable means that you can't change your variables; you mark them as final in Java, or use the val keyword in Scala Higher order functions - These are functions that take other functions as parameters, or whose result is a function. Here is a function apply which takes another function f and a value v and applies function f to v: example - def apply(f: Int => String, v: Int) = f(v) Lazy loading - Lazy val is executed when it is accessed the first time else no execution. Pattern matching - Scala has a built-in general pattern matching mechanism. It allows to match on any sort of data with a first-match policy Currying - If we turn this into a function object that we can assign or pass around, the signature of that function looks like this: val sizeConstraintFn: IntPairPred => Int => Email => Boolean = sizeConstraint _ Such a chain of one-parameter functions is called a curried function Partial application - When applying the function, you do not pass in arguments for all of the parameters defined by the function, but only for some of them, leaving the remaining ones blank. What you get back is a new function whose parameter list only contains those parameters from the original function that were left blank. Monads - Most Scala collections are monadic, and operating on them using map and flatMap operations, or using for-comprehensions is referred to as monadic-style. Q.Is it possible to have multiple SparkContext in single JVM? Answer: Yes, spark.driver.allowMultipleContexts is true (default: false ). If true Spark logs warnings instead of throwing exceptions when multiple SparkContexts are active, i.e. multiple SparkContext are running in this JVM. When creating an instance of SparkContex. Q.Can RDD be shared between SparkContexts? Answer: No, When an RDD is created; it belongs to and is completely owned by the Spark context it originated from. RDDs can’t be shared between “parkContexts. Q.In Spark-Shell, which all contexts are available by default? Answer: SparkContext and SQLContext Q.Give few examples , how RDD can be created using SparkContext Answer: SparkContext allows you to create many different RDDs from input sources like: “cala’s collections: i.e. sc.parallelize(0 to 100) Local or remote filesystems : sc.textFile("README.md") Any Hadoop InputSource : using sc.newAPIHadoopFile Q.How would you brodcast, collection of values over the Sperk executors? Answer: sc.broadcast("hello") Q.What is the advantage of broadcasting values across Spark Cluster? Answer: Spark transfers the value to Spark executors once, and tasks can share it without incurring repetitive network transmissions when requested multiple times. Q.Can we broadcast an RDD? Answer: Yes, you should not broadcast a RDD to use in tasks and Spark will warn you. It will not stop you, though. Q.How can we distribute JARs to workers? Answer: The jar you specify with SparkContext.addJar will be copied to all the worker nodes. Q.How can you stop SparkContext and what is the impact if stopped? Answer: You can stop a Spark context using SparkContext.stop() method. Stopping a Spark context stops the Spark Runtime Environment and effectively shuts down the entire Spark application. Q.Which scheduler is used by SparkContext by default? Answer: By default, SparkContext uses DAGScheduler , but you can develop your own custom DAGScheduler implementation. Q.How would you the amount of memory to allocate to each executor? Answer: SPARK_EXECUTOR_MEMORY sets the amount of memory to allocate to each executor. Q.How do you define RDD? Answer: A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. It represents an immutable, partitioned collection of elements that can be operated on in parallel. Resilient Distributed Datasets (RDDs) are a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. Resilient: Fault-tolerant and so able to recomputed missing or damaged partitions on node failures with the help of RDD lineage graph. Distributed: across clusters. Dataset: is a collection of partitioned data. Q.What is Lazy evaluated RDD mean? Answer: Lazy evaluated, i.e. the data inside RDD is not available or transformed until an action is executed that triggers the execution. Q.How would you control the number of partitions of a RDD? Answeer: You can control the number of partitions of a RDD using repartition or coalesce operations. Q.What are the possible operations on RDD Answer: RDDs support two kinds of operations: transformations - lazy operations that return another RDD. actions - operations that trigger computation and return values. Q.How RDD helps parallel job processing? Answer: Spark does jobs in parallel, and RDDs are split into partitions to be processed and written in parallel. Inside a partition, data is processed sequentially. Q.What is the transformation? Answer: A transformation is a lazy operation on a RDD that returns another RDD, like map , flatMap , filter , reduceByKey , join , cogroup , etc. Transformations are lazy and are not executed immediately, but only after an action have been executed. Q.How do you define actions? Answer: An action is an operation that triggers execution of RDD transformations and returns a value (to a Spark driver - the user program). They trigger execution of RDD transformations to return values. Simply put, an action evaluates the RDD lineage graph. You can think of actions as a valve and until no action is fired, the data to be processed is not even in the pipes, i.e. transformations. Only actions can materialize the entire processing pipeline with real data. Q.How can you create an RDD for a text file? Answer: SparkContext.textFile Q.What is Preferred Locations ? Answer: A preferred location (aka locality preferences or placement preferences) is a block location for an HDFS file where to compute each partition on. def getPreferredLocations(split: Partition): Seq specifies placement preferences for a partition in an RDD. Q.What is a RDD Lineage Graph ? Answer: A RDD Lineage Graph (aka RDD operator graph) is a graph of the parent RDD of a RDD. It is built as a result of applying transformations to the RDD. A RDD lineage graph is hence a graph of what transformations need to be executed after an action has been called. Spark Interview Questions Spark Interview Questions and Answers Q.how execution starts and end on RDD or Spark Job Answer: Execution Plan starts with the earliest RDDs (those with no dependencies on other RDDs or reference cached data) and ends with the RDD that produces the result of the action that has been called to execute. Q.Give example of transformations that do trigger jobs Answer: There are a couple of transformations that do trigger jobs, e.g. sortBy , zipWithIndex , etc. Q.How many type of transformations exist? Answer: There are two kinds of transformations: narrow transformations wide transformations Q.What is Narrow Transformations? Answer: Narrow transformations are the result of map, filter and such that is from the data from a single partition only, i.e. it is self-sustained. An output RDD has partitions with records that originate from a single partition in the parent RDD. Only a limited subset of partitions used to calculate the result. Spark groups narrow transformations as a stage. Q.What is wide Transformations? Answer: Wide transformations are the result of groupByKey and reduceByKey . The data required to compute the records in a single partition may reside in many partitions of the parent RDD. All of the tuples with the same key must end up in the same partition, processed by the same task. To satisfy these operations, Spark must execute RDD shuffle, which transfers data across cluster and results in a new stage with a new set of partitions. (54) Q.Data is spread in all the nodes of cluster, how spark tries to process this data? Answer: By default, Spark tries to read data into an RDD from the nodes that are close to it. Since Spark usually accesses distributed partitioned data, to optimize transformation operations it creates partitions to hold the data chunks Q.How would you hint, minimum number of partitions while transformation ? Answer: You can request for the minimum number of partitions, using the second input parameter to many transformations. scala> sc.parallelize(1 to 100, 2).count Preferred way to set up the number of partitions for an RDD is to directly pass it as the second input parameter in the call like rdd = sc.textFile("hdfs://… /file.txt", 400) , where400 is the number of partitions. In this case, the partitioning makes for 400 splits that would be done by the Hadoop’s TextInputFormat , not “park and it would work much faster. It’salso that the code spawns 400 concurrent tasks to try to load file.txt directly into 400 partitions. Q.How many concurrent task Spark can run for an RDD partition? Answer: Spark can only run 1 concurrent task for every partition of an RDD, up to the number of cores in your cluster. So if you have a cluster with 50 cores, you want your RDDs to at least have 50 partitions (and probably 2-3x times that). As far as choosing a "good" number of partitions, you generally want at least as many as the number of executors for parallelism. You can get this computed value by calling sc.defaultParallelism . Q.Which limits the maximum size of a partition? Answer: The maximum size of a partition is ultimately limited by the available memory of an executor. Q.When Spark works with file.txt.gz, how many partitions can be created? Answer: When using textFile with compressed files ( file.txt.gz not file.txt or similar), Spark disables splitting that makes for an RDD with only 1 partition (as reads against gzipped files cannot be parallelized). In this case, to change the number of partitions you should do repartitioning. Please note that Spark disables splitting for compressed files and creates RDDs with only 1 partition. In such cases, it’s helpful to use sc.textFile('demo.gz') and do repartitioning using rdd.repartition(100) as follows: rdd = sc.textFile('demo.gz') rdd = rdd.repartition(100) With the lines, you end up with rdd to be exactly 100 partitions of roughly equal in size. Q.What is coalesce transformation? Answer: The coalesce transformation is used to change the number of partitions. It can trigger RDD shuffling depending on the second shuffle boolean input parameter (defaults to false ). Q.What is the difference between cache() and persist() method of RDD Answer: RDDs can be cached (using RDD’s cache() operation) or persisted (using RDD’s persist(newLevel: StorageLevel) operation). The cache() operation is a synonym of persist() that uses the default storage level MEMORY_ONLY . Q.You have RDD storage level defined as MEMORY_ONLY_2 , what does _2 means ? Answer: number _2 in the name denotes 2 replicas Q.What is Shuffling? Answer: Shuffling is a process of repartitioning (redistributing) data across partitions and may cause moving it across JVMs or even network when it is redistributed among executors. Avoid shuffling at all cost. Think about ways to leverage existing partitions. Leverage partial aggregation to reduce data transfer. Q.Does shuffling change the number of partitions? Answer: No, By default, shuffling doesn’t change the number of partitions, but their content Q.What is the difference between groupByKey and use reduceByKey ? Answer : Avoid groupByKey and use reduceByKey or combineByKey instead. groupByKey shuffles all the data, which is slow. reduceByKey shuffles only the results of sub-aggregations in each partition of the data. Q.When you call join operation on two pair RDDs e.g. (K, V) and (K, W), what is the result? Answer: When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key Q.What is checkpointing? Answer: Checkpointing is a process of truncating RDD lineage graph and saving it to a reliable distributed (HDFS) or local file system. RDD checkpointing that saves the actual intermediate RDD data to a reliable distributed file system. You mark an RDD for checkpointing by calling RDD.checkpoint() . The RDD will be saved to a file inside the checkpoint directory and all references to its parent RDDs will be removed. This function has to be called before any job has been executed on this RDD. Q.What do you mean by Dependencies in RDD lineage graph? Answer: Dependency is a connection between RDDs after applying a transformation. Q.Which script will you use Spark Application, using spark-shell ? Answer: You use spark-submit script to launch a Spark application, i.e. submit the application to a Spark deployment environment. Q.Define Spark architecture Answer: Spark uses a master/worker architecture. There is a driver that talks to a single coordinator called master that manages workers in which executors run. The driver and the executors run in their own Java processes. Q.What is the purpose of Driver in Spark Architecture? Answer: A Spark driver is the process that creates and owns an instance of SparkContext. It is your Spark application that launches the main method in which the instance of SparkContext is created. Drive splits a Spark application into tasks and schedules them to run on executors. A driver is where the task scheduler lives and spawns tasks across workers. A driver coordinates workers and overall execution of tasks. Q.Can you define the purpose of master in Spark architecture? Answer: A master is a running Spark instance that connects to a cluster manager for resources. The master acquires cluster nodes to run executors. Q.What are the workers? Answer: Workers or slaves are running Spark instances where executors live to execute tasks. They are the compute nodes in Spark. A worker receives serialized/marshalled tasks that it runs in a thread pool. Q.Please explain, how worker’s work, when a new Job submitted to them? Answer: When SparkContext is created, each worker starts one executor. This is a separate java process or you can say new JVM, and it loads application jar in this JVM. Now executors connect back to your driver program and driver send them commands, like, foreach, filter, map etc. As soon as the driver quits, the executors shut down Q.Please define executors in detail? Answer: Executors are distributed agents responsible for executing tasks. Executors provide in- memory storage for RDDs that are cached in Spark applications. When executors are started they register themselves with the driver and communicate directly to execute tasks. Q.What is DAGSchedular and how it performs? Answer: DAGScheduler is the scheduling layer of Apache Spark that implements stage-oriented scheduling, i.e. after an RDD action has been called it becomes a job that is then transformed into a set of stages that are submitted as TaskSets for execution. DAGScheduler uses an event queue architecture in which a thread can post DAGSchedulerEvent events, e.g. a new job or stage being submitted, that DAGScheduler reads and executes sequentially. Q.What is stage, with regards to Spark Job execution? Answer: A stage is a set of parallel tasks, one per partition of an RDD, that compute partial results of a function executed as part of a Spark job. Q.What is Task, with regards to Spark Job execution? Answer: Task is an individual unit of work for executors to run. It is an individual unit of physical execution (computation) that runs on a single machine for parts of your Spark application on a data. All tasks in a stage should be completed before moving on to another stage. A task can also be considered a computation in a stage on a partition in a given job attempt. A Task belongs to a single stage and operates on a single partition (a part of an RDD). Tasks are spawned one by one for each stage and data partition. Q.What is Speculative Execution of a tasks? Answer: Speculative tasks or task strugglers are tasks that run slower than most of the all tasks in a job. Speculative execution of tasks is a health-check procedure that checks for tasks to be speculated, i.e. running slower in a stage than the median of all successfully completed tasks in a taskset . Such slow tasks will be re-launched in another worker. It will not stop the slow tasks, but run a new copy in parallel. Q.Which all cluster manager can be used with Spark? Answer: Apache Mesos, Hadoop YARN, Spark standalone and Spark local: Local node or on single JVM. Drivers and executor runs in same JVM. In this case same node will be used for execution. Q.What is a BlockManager? Answer: Block Manager is a key-value store for blocks that acts as a cache. It runs on every node, i.e. a driver and executors, in a Spark runtime environment. It provides interfaces for putting and retrieving blocks both locally and remotely into various stores, i.e. memory, disk, and offheap. A BlockManager manages the storage for most of the data in Spark, i.e. block that represent a cached RDD partition, intermediate shuffle data, and broadcast data. Q.What is Data locality / placement? Answer: Spark relies on data locality or data placement or proximity to data source, that makes Spark jobs sensitive to where the data is located. It is therefore important to have Spark running on Hadoop YARN cluster if the data comes from HDFS. With HDFS the Spark driver contacts NameNode about the DataNodes (ideally local) containing the various blocks of a file or directory as well as their locations (represented as InputSplits ), and then schedules the work to the “parkWorkers. “park’s compute nodes / workers should be running on storage nodes. Q.What is master URL in local mode? Answer: You can run Spark in local mode using local , local or the most general local. The URL says how many threads can be used in total: local uses 1 thread only. local uses n threads. local uses as many threads as the number of processors available to the Java virtual machine (it uses Runtime.getRuntime.availableProcessors() to know the number). Q.Define components of YARN? Answer: YARN components are below ResourceManager: runs as a master daemon and manages ApplicationMasters and NodeManagers. ApplicationMaster: is a lightweight process that coordinates the execution of tasks of an application and asks the ResourceManager for resource containers for tasks. It monitors tasks, restarts failed ones, etc. It can run any type of tasks, be them MapReduce tasks or Giraph tasks, or Spark tasks. NodeManager offers resources (memory and CPU) as resource containers. NameNode Container: can run tasks, including ApplicationMasters. Q.What is a Broadcast Variable? Answer: Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. Q.How can you define Spark Accumulators? Answer: This are similar to counters in Hadoop MapReduce framework, which gives information regarding completion of tasks, or how much data is processed etc. Q.What all are the data sources Spark can process? Answer: Hadoop File System (HDFS) Cassandra (NoSQL databases) HBase (NoSQL database) S3 (Amazon WebService Storage : AWS Cloud) Q.What is Apache Parquet format? Answer: Apache Parquet is a columnar storage format Q.What is Apache Spark Streaming? Answer: Spark Streaming helps to process live stream data. Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. contact for more on Spark Online Training
Continue reading
Splunk Interview Questions
Q.What Is Splunk? Ans: Splunk is Google for your machine data. It’s a software/Engine which can be used for searching, visualizing, Monitoring, reporting etc of your enterprise data. Splunk takes valuable machine data and turns it into powerful operational intelligence by providing real-time insight into your data through charts, alerts, reports etc. Q.What Are Components Of Splunk/Splunk Architecture? Ans: Below are components of Splunk: Search head– provides GUI for searching Indexer– indexes machine data Forwarder-Forwards logs to Indexer Deployment server-Manges Splunk components in a distributed environment Q.Which Is Latest Splunk Version In Use? Ans: Splunk 6.3. Q.What Is Splunk Indexer?what Are Stages Of Splunk Indexing? Ans: The indexer is the Splunk Enterprise component that creates and manages indexes. The primary functions of an indexer are: Indexing incoming data. Searching the indexed data. Q.What Is A Splunk Forwarder And What Are Types Of Splunk Forwarder? Ans: There are two types of splunk forwarder as below: universal forwarder(UF)-Splunk agent installed on non-Splunk system to gather data locally, can’t parse or indexdata Heavy weight forwarder(HWF)– full instance of splunk with advance functionality. Generally works as a remote collector, intermediate forwarder, and possible data filter because they parse data, they are not recommended for production systems Q.What Are Types Of Splunk Licenses? Ans: Enterprise license Free license Forwarder license Beta license Licenses for search heads (for distributed search) Licenses for cluster members (for index replication) Q.What Is Splunk App? Ans: Splunk app is container/directory of configurations,searches,dashboards etc. in splunk Q.Where Does Splunk Default Configuration Is Stored? Ans: $splunkhome/etc/system/default Q.What Features Are Not Available In Splunk Free? Ans: Splunk free lacks these features: Authentication and scheduled searches/alerting Distributed search Forwarding in TCP/HTTP (to non-Splunk) Deployment management Q.What Happens If The License Master Is Unreachable? Ans: License slave will start a 24-hour timer, after which search will be blocked on the license slave (though indexing continues). users Will not be able to search data in that slave until it can reach license master again. Q.What Is Summary Index In Splunk? Ans: The Summary index is the default summary index (the index that Splunk Enterprise uses if you do not indicate another one). If you plan to run a variety of summary index reports you may need to create additional summary indexes. Q.What Is Splunk Db Connect? Ans: Splunk DB Connect is a generic SQL database plugin for Splunk that allows you to easily integrate database information with Splunk queries and reports. Q.Can You Write Down A General Regular Expression For Extracting Ip Address From Logs? Ans: There are multiple ways we can extract ip address from logs.Below are few examples. Regular Expression for extracting ip address: rex field=_raw “(?d+.d+.d+.d+)” OR rex field=_raw “(?({1,3}){3}{1,3})” Q.What Is Difference Between Stats Vs Transaction Command? Ans: The transaction command is most useful in two specific cases: Unique id (from one or more fields) alone is not sufficient to discriminate between two transactions. This is the case when the identifier is reused, for example web sessions identified by cookie/client IP. In this case, time span or pauses are also used to segment the data into transactions. In other cases when an identifier is reused, say in DHCP logs, a particular message may identify the beginning or end of a transaction. When it is desirable to see the raw text of the events combined rather than analysis on the constituent fields of the events. In other cases, it’s usually better to use stats as the performance is higher, especially in a distributed search environment. Often there is a unique id and stats can be used. Q.How To Troubleshoot Splunk Performance Issues? Ans: Check splunkd.log for any errors Check server performance issues i.e. cpu/memory usage,disk i/o etc Install SOS (Splunk on splunk) app and check for warning and errors in dashboard check the number of saved searches currently running and their system resources consumption install Firebug, which is a Firefox extension. After it’s installed and enabled, log into splunk (using firefox), open firebug’s panels, switch to the ‘Net’ panel (you will have to enable it).The Net panel will show you the HTTP requestsand responses along with the time spent in each. This will give you a lot of information quickly over which requests are hanging splunk for a few seconds, and which are blameless. etc.. Q.What Are Buckets? Explain Splunk Bucket Lifecycle? Ans: Splunk places indexed data in directories, called as “buckets”. It is physically a directory containing events of a certain period. A bucket moves through several stages as it ages: Hot: Contains newly indexed data. Open for writing. One or more hot buckets for each index. Warm:Data rolled from hot. There are many warm buckets. Colld: Data rolled from warm. There are many cold buckets. Frozen:Data rolled from cold. The indexer deletes frozen data by default, but you can also archive it. Archived data can later be thawed (Data in frozenbuckets is not searchable) By default, your buckets are located in $SPLUNK_HOME/var/lib/splunk/defaultdb/db. You should see the hot-db there, and any warm buckets you have.By default, Splunk sets the bucket size to 10GB for 64bit systems and 750MB on 32bit systems. Q.What Is The Difference Between Stats And Eventstats Commands? Ans: Stats command generate summary statistics of all existing fields in your search results and save them as values in new fields. Eventstats is similar to the stats command, except that aggregation results are added inline to each event and only if the aggregation is pertinent to that event. eventstats computes the requested statistics like stats, but aggregates them to the original raw data. Q.Who Are The Biggest Direct Competitors To Splunk? Ans: logstash Loggly Loglogic sumo logic etc.. Q.Splunk Licenses Specify What ? Ans: How much data you can index per calendar day.? Q.How Does Splunk Determine 1 Day, From A Licensing Perspective ? Ans: Midnight to midnight on the clock of the license master. Q.How Are Forwarder Licenses Purchased ? Ans: They are included with splunk, no need to purchase separately. Q.What Is Command For Restarting Just The Splunk Webserver? Ans: Splunk start splunkwebWhat Is Command For Restarting Just The Splunk Daemon? Ans: Splunk start splunkd. Q.What Is Command To Check For Running Splunk Processes On Unix/linux ? Ans: ps aux | grep splunk. Q.What Is Command To Enable Splunk To Boot Start? Ans: $SPLUNK_HOME/bin/splunk enable boot-start. Q.How To Disable Splunk Boot Start? Ans: $SPLUNK_HOME/bin/splunk disable boot-start Q.What Is Sourcetype In Splunk? Ans: Sourcetype is splunk way of identifying data Q.How To Reset Splunk Admin Password? Ans: To reset your password log in to server on which splunk is installed and rename passwd file at below location and then restart splunk.After restart you can login using default username: admin password: changeme $splunk-homeetcpasswd Q.How To Disable Splunk Launch Message? Ans: Set value OFFENSIVE=Less in splunk_launch.conf Q.How To Clear Splunk Search History? Ans: Delete following file on splunk server : $splunk_home/var/log/splunk/searches.log Q.What Is Btool Or How Will You Troubleshoot Splunk Configuration Files? Ans: Splunk btool is a command line tool that helps us to troubleshoot configuration file issues or just see what values are being used by your Splunk Enterprise installation in existing environment Q.What Is Difference Between Splunk App And Splunk Add On? Ans: Basically both contains preconfigured configuration and reports etc but splunk add on do not have visual app. Splunk apps have preconfigured visual app. Q.What Is .conf Files Precedence In Splunk? Ans: File precedence is as follows: System local directory: highest priority App local directories App default directories System default directory: lowest priority Q.What Is Fishbucket Or What Is Fishbucket Index? Ans: It’s a directory or index at default location /opt/splunk/var/lib/splunk .It contains seek pointers and CRCs for the files you are indexing, so splunkd can tell if it has read them already.We can access it through GUI by seraching for “index=_thefishbucket” Q.How Do I Exclude Some Events From Being Indexed By Splunk? Ans: This can be done by defining a regex to match the necessary event(s) and send everything else to nullqueue.Here is a basic example that will drop everything except events that contain the string login In props.conf: # Transforms must be applied in this order # to make sure events are dropped on the # floor prior to making their way to the # index processor TRANSFORMS-set= setnull,setparsing In transforms.conf REGEX = . DEST_KEY = queue FORMAT = nullQueue REGEX = login DEST_KEY = queue FORMAT = indexQueue Q.How Can I Tell When Splunk Is Finished Indexing A Log File? Ans: By watching data from splunk’s metrics log in real-time. index=”_internal” source=”*metrics.log” group=”per_sourcetype_thruput” series=”” | eval MB=kb/1024 | chart sum(MB) or to watch everything happening split by sourcetype…. index=”_internal” source=”*metrics.log” group=”per_sourcetype_thruput” | eval MB=kb/1024 | chart sum(MB) avg(eps) over series And if you’re having trouble with a data input and you want a way to troubleshoot it, particularly if your whitelist/blacklist rules aren’t working the way you expect. Q.What Is Dispatch Directory? Ans: $SPLUNK_HOME/var/run/splunk/dispatch contains a directory for each search that is running or has completed. For example, a directory named 1434308943.358 will contain a CSV file of its search results, a search.log with details about the search execution, and other stuff. Using the defaults (which you can override in limits.conf), these directories will be deleted 10 minutes after the search completes – unless the user saves the search results, in which case the results will be deleted after 7 days. Q.What Is Difference Between Search Head Pooling And Search Head Clustering? Ans: Both are features provided splunk for high availability of splunk search head in case any one search head goes down.Search head cluster is newly introduced and search head pooling will be removed in next upcoming versions.Search head cluster is managed by captain and captain controls its slaves.Search head cluster is more reliable and efficient than search head pooling. Q.If I Want Add/onboard Folder Access Logs From A Windows Machine To Splunk How Can I Add Same? Ans: Below are steps to add folder access logs to splunk: Enable Object Access Audit through group policy on windows machine on which folder is located Enable auditing on specific folder for which you want to monitor logs Install splunk universal forwarder on windows machine Configure universal forwarder to send security logs to splunk indexer Q.How Would You Handle/troubleshoot Splunk License Violation Warning Error? Ans: License violation warning means splunk has indexed more data than our purchased license quota.We have to identify which index/sourcetype has received more data recently than usual daily data volume.We can check on splunk license master pool wise available quota and identify the pool for which violation is occurring.Once we know the pool for which we are receiving more data then we have to identify top sourcetype for which we are receiving more data than usual data.Once sourcetype is identified then we have to find out source machine which is sending huge number of logs and root cause for the same and troubleshoot accordingly. Q.What Is Mapreduce Algorithm? Ans: Mapreduce algorithm is secret behind Splunk fast data searching speed.It’s an algorithm typically used for batch based large scale parallelization. It’s inspired by functional programming’s map() and reduce () functions. Q.How Splunk Avoids Duplicate Indexing Of Logs ? Ans: At indexer splunk keeps track of indexed events in a directory called fish buckets (default location /opt/splunk/var/lib/splunk). It contains seek pointers and CRCs for the files you are indexing, so splunkd can tell if it has read them already. Q.What Is Difference Between Splunk Sdk And Splunk Framework? Ans: Splunk SDKs are designed to allow you to develop applications from the ground up and not require Splunk Web or any components from the Splunk App Framework. These are separately licensed to you from the Splunk Software and do not alter the Splunk Software.Splunk App Framework resides within Splunk’s web server and permits you to customize the Splunk Web UI that comes with the product and develop Splunk apps using the Splunk web server. It is an important part of the features and functionalities of Splunk Software , which does not license users to modify anything in the Splunk Software. Contact for More On Splunk Online Training
Continue reading
Tableau Interview Questions
Q.What is Data Visualization? Answer: Data visualization is, quite simply, the process of describing information through visual rendering. Humans have used visualizations to explain the world around them for millions of years. Data visualization allows for universal and immediate insight by tapping into our mind’s powerful visual processing system. Q.Why is Data Visualization Important? Answer: Technological advances have made data visualization more prevalent and powerful than ever before, increasing the importance of business intelligence. Tableau leads the world in making the data visualization process available to business users of every background and industry. Businesses around the globe realize that the ability to visualize data effectively leads directly to better understanding, insight and better business decisions. Tableau Software enables businesses to keep pace with the evolving technology landscape and outperform competitors through an adaptive and intuitive means of visualizing their data. Q. What is Tableau? Answer: Tableau is business intelligence software that allows anyone to connect to data in a few clicks, then visualize and create interactive, sharable dashboards with a few more. It's easy enough that any Excel user can learn it, but powerful enough to satisfy even the most complex analytical problems. Securely sharing your findings with others only takes seconds. The result is BI software that you can trust to actually deliver answers to the people that need them Q.What is Tableau Desktop? Answer: Tableau Software provides software applications for fast analytical and rapid fire business intelligence. Tableau Desktop is a data visualization application that lets you analyze virtually any type of structured data and produce highly interactive, beautiful graphs, dashboards, and reports in just minutes. After a quick installation, you can connect to virtually any data source from spreadsheets to data warehouses and display information in multiple graphic perspectives. Designed to be easy to use, you’ll be working faster than ever before. Tableau Server is a business intelligence solution that provides browser-based visual analytics anyone can use at just a fraction of the cost of typical BI software. With just a few clicks, you can publish or embed live, interactive graphs, dashboards and reports with current data automatically customized to the needs of everyone across your organization. It deploys in minutes and users can produce thousands of reports without the need of IT services — all within your IT infrastructure. Tableau Reader is a free viewing application that lets anyone read and interact with packaged workbooks created by Tableau Desktop. Q.How do you create dashboard? Can you explain the life cycle? Answer:A dashboard is a collection of several worksheets and supporting information shown in a single place so you can compare and monitor a variety of data simultaneously. For example, you may have a set of views that you review every day. Rather than flipping through each worksheet, you can create a dashboard that displays all the views at once. You can create a dashboard in much the same way you create a new worksheet. Select Dashboard > New Dashboard. Alternatively, click the New Dashboard tab along the bottom of the workbook. A new tab for the dashboard is added along the bottom of the workbook. Switch to the new dashboard to add views and objects. When you open a dashboard the Dashboard window replaces the Data window on the left side of the workbook. The Dashboard window lists the worksheets that are currently in the workbook. As you create new worksheets, the Dashboard window updates so you always have all worksheets available when adding to a dashboard it. After a view is added to the dashboard, the worksheet is marked with a check mark in the Dashboard window. Also, any legends or quick filters that are turned on for the sheet are automatically added to the dashboard.By default, dashboards use a Tiled layout, which means that each view and object is arranged into a single layered grid. You can change the layout to Floating to allow views and objects to overlap. See Organizing Dashboards to learn more about these layouts. Q. How can you schedule the Reports in tableau? Explain briefly? Answer: Schedules when you publish workbooks that connect to extracts you can schedule the extracts to be refreshed automatically. That way you don't have to republish the workbook every time the underlying data has updated and you can still get the performance of a data extract. For example, let's say you have a workbook that connects to a large data warehouse that is updated weekly. Instead of publishing a workbook that queries the live data, you can create an extract including just the data necessary. This increases performance and avoids queries to the live database. Then you can add that workbook to a schedule so that the extract is refreshed at regular intervals with updated data from the data warehouse.Schedules are created and managed on the server by an administrator. However, an administrator can allow you to add a workbook to a schedule when you are publishing from Tableau Desktop. As you are publishing a workbook, in the Publish Workbook to Tableau Server dialog box, click Scheduling & Authentication. In the Scheduling & Authentication dialog box, select a schedule for the workbook: All data sources that require authentication must have an embedded password so that the extract can be refreshed. This includes data sources that are not extracts. Q.When export a worksheet into tableau server how to give a connection to database to run that report in server? Answer:When you publish workbooks that connect to extracts you can schedule the extracts to be refreshed automatically. That way you don't have to republish the workbook every time the underlying data has updated and you can still get the performance of a data extract. For example, let's say you have a workbook that connects to a large data warehouse that is updated weekly. Instead of publishing a workbook that queries the live data, you can create an extract including just the data necessary. This increases performance and avoids queries to the live database. Then you can add that workbook to a schedule so that the extract is refreshed at regular intervals with updated data from the data warehouse. Schedules are created and managed on the server by an administrator. However, an administrator can allow you to add a workbook to a schedule when you are publishing from . Q.What is the major difference between 7.0 and 8.0 in tableau? And latest? Answer: New visualizations are introduced like tree map, bubble chart and box and whisker plot We can copy worksheet directly from one workbook to another workbook Introduced R script Q.What are parameters and when do you use it? Answer:Parameters are dynamic values that can replace constant values in calculations. We can create parameters in 3 ways: Filters Reference lines Calculate Field . < strong class="question-class">Q.What are the possible reasons for slow performance in Tableau? Answer: One of the reasons is that filters may not be defined appropriately at report level due to which the entire data set is pulled from the query (which may not be necessary). There are some of the reasons: Creating a query that returns a large number of records from the underlying table(s), when a smaller number of aggregated records would suffice. You can check this by looking in the lower-left corner of the Tableau Desktop work space and looking at the number of marks. If this number is very large, you are potentially pulling a large amount of data from the database Use native drivers: Tableau will recommend or require you to create a data extract to continue working with a particular driver. Usage of native driver instead ODBC connections as it will generally provide better performance. Test with another tool: A good way to determine if a slow workbook is being caused by a slow query is to test the same query in another tool, such as Microsoft Access or Microsoft Excel. To find the query being run, look in Aditya kommu\My Tableau Repository\Logs and find a file titled log.txt. Open this file and scroll up from the bottom until you find a section like the following: The section between the begin and end query tags is the query that was passed to the database. You can copy this text and then use it from a tool like Access or Excel. If it takes a similar time to return as in Tableau, then it's likely the problem is with the query, not the tools. Use extracts: Create a tableau extract if you having performance issues. These extract files can include performance-oriented features such as pre-aggregated data for hierarchies and pre-calculated calculated fields (reducing the amount of work required to render and display the visualization). FOR DBA point: 1)Tune your indexes :Make certain you have indexes on all columns that are part of table joins Make certain you have indexes on any column used in a filter Explicitly define primary keys Explicitly define foreign key relationships For large data sets, use table partitioning Define columns as NOT NULL where possible Use statistics Databases engines collect statistical information about indexes and column data stored in the database. These statistics are used by the query optimizer to choose the most efficient plan for retrieving or updating that. Optimize the data mode: beneficial to create summary tables if most of your queries only need aggregated data - not base level details records. Q.What the Difference is between connect live and import all data and Import some data? Connect live – Creates a direct connect to your data. The speed of your data source will determine performance. Import all data – Imports the entire data source into Tableau’s fast data engine as an extract. The extract is saved with the workbook. Import some data – Imports a subset of your data into Tableau’s fast data engine as an extract. This option requires you to specify what data you want to extract using filters. Q.What is Ad-hoc Reports in tableau? Explain? Answer: Actually Ad-hoc reports means on the spot based on the client requirement by connecting to live environment we can create reports these reports are called Ad-hoc reports. Q.What is the Difference between quick filter and Normal filter in tableau? Answer: Quick filter is used to view the filtering options and can be used to select the option.Normal filer is something you can limit the options from the list or use some conditions to limit the data by filed or value. Q.Does Tableau Public work on a Mac? Answer: Macintosh users can view Tableau Public content in their browser. Tableau Desktop Public Edition used for authoring content is a Windows application only. If you are using a Macintosh computer that has an Intel processor, you can use virtualization software such as VMware Fusion or Parallels Desktop to install Windows and run Tableau Desktop Public Edition. Alternatively, you can use a built-in utility called Boot Camp to install Windows and run the Tableau software. Q.How do I automate reports using Tableau software? You need to publish report to tableau server, while publishing you will find one option to schedule reports.You just need to select the time when you want to refresh data. Q. How does Tableau perform with huge data sets? Answer: Due to VizSQL Q.Name the components of dashboard? Answer: Horizontal Vertical Text Images etc . Q How is Tableau so fast when working with databases? Answer: Tableau compiles the elements of your visual canvas into a SQL or MDX query for the remote database to process. Since a database typically runs on more powerful hardware than the laptops / workstations used by analysts, you should generally expect the database to handle queries much faster than most in-memory BI applications limited by end-user hardware. Tableau's ability to push computation (queries) close to the data is increasingly important for large data sets, which may reside on a fast cluster and may be too large to bring in-memory. Another factor in performance relates to data transfer, or in Tableau's case result set transfer. Since Tableau visualizations are designed for human consumption, they are tailored to the capabilities and limits of the human perception system. This generally means that the amount of data in a query result set is small relative to the size of the underlying data, and visualizations focus on aggregation and filtering to identify trends and out liers. The small result sets require little network bandwidth, so Tableau is able to fetch and render the result set very quickly. And, as Ross mentioned, Tableau will cache query results for fast reuse. The last factor as mentioned by Eriglen involves Tableau's ability to use in-memory acceleration as needed (for example, when working with very slow databases, text files, etc.). Tableau's Data Engine uses memory-mapped I/O, so while it takes advantage of in-memory acceleration it can easily work with large data sets which cannot fit in memory. The Data Engine will work only with the subsets of data on disk which are needed for a given query, and the data subsets are mapped into memory as needed. Q.What is Tableau Desktop? Answer: Tableau Desktop is a data visualization application that lets you analyze virtually any type of structured data and produce highly interactive, beautiful graphs, dashboards, and reports in just minutes. After a quick installation, you can connect to virtually any data source from spreadsheets to data warehouses and display information in multiple graphic perspectives. Designed to be easy to use, you’ll be working faster than ever before. Q.What is Tableau Reader? Answer: Tableau Reader is a free viewing application that lets anyone read and interact with packaged workbooks created by Tableau Desktop. Q.How Does Tableau Work? Answer: While Tableau lets you analyze databases and spreadsheets like never before, you don’t need to know anything about databases to use Tableau. In fact, Tableau is designed to allow business people with no technical training to analyze their data efficiently. Tableau is based on three simple concepts: Connect - Connect Tableau to any database that you want to analyze. Note that Tableau does not import the data. Instead it queries to the database directly. Analyze - Analyzing data means viewing it, filtering it, sorting it, performing calculations on it, reorganizing it, summarizing it, and so on. Using Tableau you can do all of these things by simply arranging fields of your data source on a Tableau worksheet. When you drop a field on a worksheet, Tableau queries the data using standard drivers and query languages (like SQL and MDX) and presents a visual analysis of the data. Share - You can share results with others either by sharing workbooks with other Tableau users, by pasting results into applications such as Microsoft Office, printing to PDF or by using Tableau Server to publish or embed your views across your organization. Q.When do you use horizontal and vertical components? Answer: We can use these when we want to have all sheets or filter to move in single shot.. however we can still create the dashboard without this also.. this allows us to make our work simple Q.Can you explain about table calculations? Answer: These are inbuilt calculations in tableau which we normally use to calculate Percentage from or for YTD and other calculations like the measure across table, below table and etc.. Q.How we can find the tableau Report Rendering Time.? Answer: Report rendering time=Network time(request from URL to Report server) +Query execution time + Network time(response from SQL Server)+calculations(table column)+time taken to display the report in desired format(HTML/ pdf/ excel) Q. VizQL is a visual query language? Answer: VizQL is a visual query language that translates drag-and-drop actions into data queries and then expresses that data visually. VizQL delivers dramatic gains in people’s ability to see and understand data by abstracting the underlying complexities of query and analysis. The result is an intuitive user experience that lets people answer questions as fast as they can think of them. We believe that VizQL represents a foundational advancement in the area of data analysis and visualization. Q. Why should you use tableau? Answer: There are many reasons why one should use tableau they are It is very easy to use: You don’t need to know programming of any sort, all you need is some data and tableau to create reports that are visually enchanting and which tells a story which you need to tell your managers or impress your professor in class. With its revolutionary drag and drop feature u can easily create stories or reports using just your mouse and a little imagination. All this is possible due to the revolutionary VizQL a visual query language. Q.How many types of filters are there in Tableau.? Answer: In Tableau, there are three types of filters. More explicitly, there are three different ways to limit the data that is displayed by your graph. Each of these has its own strengths and weaknesses, and we will look at them one at a time. These types are Custom SQL "Filters" Context Filters Traditional Filters. Custom SQL Filters: Custom SQL "Filter" is a WHERE clause that is placed in the SQL that queries the data to be used in the workbook. "Filter" is a Tableau term that technically applies only to Context and Traditional Filters; however, the Custom SQL "Filter" emulates the behavior of a global Context Filter, so we will refer to it as such. By construction, Custom SQL "Filters" are always global. The most common reason for using a Custom SQL "Filter" is to limit the size of a data extract. The smaller your data extract, the more quickly your charts will load. In other words, you can make more complex charts without sacrificing efficiency. One of the ways to create a Custom SQL "Filter" is during the Server Connection process. Context Filters: a Context Filter is a filter in Tableau that affects the data that is transferred to each individual worksheet. Context Filters are great when you want to limit the data seen by the worksheet. When a worksheet queries the data source, it creates a temporary, flat table that is uses to compute the chart. This temporary table includes all values that are not filtered out by either the Custom SQL or the Context Filter. Just like with Custom SQL "Filters", your goal is to make this temporary table as small as possible. Context Filters have a few advantages over Traditional Filters. First, they execute more quickly than Traditional Filters. They are also executed before Traditional Filters and can be executed all at once, which further improves efficiency. However, they do have one drawback. It takes time for the filter to be placed into context. A rule of thumb, from Tableau's Knowledge Base, is to only place a filter into context if it reduces the data by at least 10%. A Context Filter is created by dragging a field onto the "Filters" Shelf and editing the filter. Then, you can Right-Click the field on the shelf and select "Add to Context." If you have multiple context filters, you can CTRL-Select them all and add them to context in a batch. This will improve the efficiency of your filter. Traditional Filters: Traditional Filter is exactly what most people think of when they think of filters. When Tableau is creating the visualization, it will check to see if a value is filtered out by a Traditional Filter. Since this is not performed at the table level, it is the slowest of all filter types. However, it does have the advantage of being performed after the Context Filters. This is a necessity if you are dealing with complex "Top N" filters. A Traditional Filter can be created by simply dragging a field onto the "Filters" Shelf. Tableau Interview Questions Tableau Interview Questions and Answers Q. How to Improve Performance in Tableau.? Answer: Use an extract. There is nothing else that comes close to the efficiency gained using an extract. If you don't absolutely need live data, extracting is the best bet. Limit your dashboard to fully answering only one scenario. At it's simplest, a dashboard should be able to fully explore a single scenario. If your dashboard has six sheets, five actions, and 3 quick filters, you might not be looking at only one scenario. Remember, no matter how elegant and comprehensive your solution is, if it doesn't run as quickly as the user would like it to, he or she will not use it. I would not recommend butchering your dashboard so heavily that it cannot fully handle a scenario. If the user has to go somewhere else to find the answer, why did they use your dashboard at all? Limit the data being introduced to each worksheet. If you are not planning on using a set of rows, you should filter them out of the data set as early as possible. If your table contains all sales, and you only want to look at US sales, create a Custom SQL query that filters it out. If the filter is worksheet dependent, try using a Context Filter. For more information on filtering, check out my other post Types of Filters in Tableau. You can also click the Down Arrow beside the word "Dimension" and Select "Hide All Unused Fields" to hide any fields you are not using in any of your worksheets. I'm not sure if this improves efficiency; but I'd have to imagine that it does, less data should always improve performance. Remove components that add no value. While aesthetics are very important to building a usable dashboard, unimportant objects aren't worth losing efficiency over. In fact, you would be better off adding more functionality than you would by adding a purely aesthetic object. any non-essential components from the visualization. This refers to values that would appear on the Pages, Filters, and Level of Detail Shelves. If they are purely there for the user to see if they scroll over a point, then they aren't adding any value to the initial glance. However, I leave this as the last step because it should be a last resort. In most cases, a little forethought can save you a lot of heartache when you are creating dashboards. Decide exactly what story you want to tell, and tell only that. It is much easier to add functionality to a small dashboard, than to butcher a large one. Thanks for reading. Q.What is Tableau Public? Answer: The free version of Tableau Public is for people. This includes writers, bloggers, students, professors, hobbyists, journeymen, critics, citizens and more. It?s also meant for organizations, but only as an introductory service. If your organization wants to put data online for the public, you are welcome to use this as an introductory service. If you like what you see, contact us at info-public@tableausoftware.com to discuss a commercial relationship. Q.How does Tableau Public work? Answer: Tableau Public includes a free desktop product that you can download and use to publish interactive data visualizations to the web. The Tableau Public desktop saves work to the Tableau Public web servers ? nothing is saved locally on your computer. All data saved to Tableau Public will be accessible by everyone on the internet, so be sure to work only with publicly available (and appropriate) data. Q. I have my own blog or website. Can I use Tableau Public to share data there? Answer: Yes. Use Tableau Public to share data and insights with your community. Embed the content in your blog or website, or share it via links on web pages or in emails. Use our website to find out how. Learn more about Sharing Views in the Knowledge Base. Q. Is there a limit on storage space for the data? Answer: Yes, there is a 1 gigabyte limit on storage space for data. For the vast majority of users, we expect that 1 GB will be more space than needed. Learn more about Data Requirements and Limitations in the Knowledge Base. Q. Do I need to be a programmer to use this? Answer: No programming skills of any kind are required. It is a simple drag and drop process that anyone can easily learn. Q.I work for an organization that has lots of data to share with the public. Can we use Tableau Public? Answer: Yes , as long as you and everyone at your organization together use less than 50 megabytes of space. Tableau Public gives your organization (e.g. If your organization wants to put data online for the public, please contact us at info-public@tableausoftware.com to discuss a commercial relationship. . Q. Do I need the free desktop product if I already own a commercial version of Tableau Desktop? Answer: No. Tableau Desktop comes in three editions: Professional Edition ($1,999), Personal Edition ($999) and Public Edition (free). If you already have Professional or Personal Edition, you will find that your latest upgrade includes the ability to publish to the Tableau Public servers. There is nothing in Public Edition that is included in the latest versions of the paid products. Q.What type of data limitations does Tableau Public have? Answer: Tableau Public can connect to Microsoft Excel, Microsoft Access, and multiple text file formats. It has a limit of 1,000,000 rows of data that is allowed in any single file. Learn more about Data Requirements and Limitations in the Knowledge Base. Q.Can I set permissions or protect the data I save to Tableau Public? Answer: All content saved to Tableau Public is accessible to everyone on the internet. As the author, you are the only one who can delete your own content, but anyone on the internet can view it. In addition to viewing it, anyone can download a copy of your workbook (including the underlying data) as well, which will let them work with and build upon your original. Q.If I publish my data on Tableau Public. Is my data now public? Answer: Yes, your data on Tableau Public is now accessible on the world-wide web and is down loadable by anyone. We strongly suggest you only publish data you are willing to share with anyone. Q.What kind of technology is a Tableau Viz? Answer: It is a thin AJAX based JavaScript application. Q.How do people find the visualizations I save to Tableau Public? Answer: Once you save your work to Tableau Public, it may be shared (by emailing a link or embedding the work in your blog, wiki, or website). If you embed the work onto a web page, anyone who visits the page will see the live interactive view. If you email a link, just clicking the link will open a browser page with the view loaded. Learn more about Sharing Views in the Knowledge Base. Q."What is the ""Download"" link on the Viz?" Any Tableau Public viz can be downloaded by pressing the download link in the lower right corner. It saves to your computer as a TWBX file. Anyone with Tableau Desktop (Professional, Personal or Public Edition) can open the file and review or extend the work that was behind the original posting. Learn more about Downloading Tableau Public Workbooks in the Knowledge Base. Q.Is there a plug-in required to see the Tableau Viz? No plug-ins are required. You just need a browser with JavaScript enabled. Q. What happens if I delete a workbook from Tableau Public and there are links to it in blogs or other web locations? Once a workbook or view is deleted from Tableau Public, it cannot be retrieved from Tableau Public by anyone. All links or other references to it that may have been shared will not be able to load the viz and will display an error message on the page. Q.What do you mean by ""Data In. Brilliance Out.""? Answer: This is our vision for Tableau Public. In captures both our twin goals of making Tableau incredibly easy to use and spectacularly powerful .Q.What is KPI in Tableau.? Answer: We can easily create a view that shows Key Progress Indicators (KPIs). To do this, you complete the following tasks: Create the base view with the fields you want to measure. Build a calculated field that establishes the figure from which you measure progress for the data you’re measuring. Use shapes that Tableau provides that are designed specifically for KPIs. This example shows how to build a KPI view that shows a green check mark for any sales figure over $125,000, and a red X for any sales figure under $125,000. Preparing data for Tableau. Cleanup dimensions and measure names. Set attribute aliases. Set default colors Set default measure aggregations. Create calculated fields Q.Is Parameter have it's dropdown list ? Answer: Yes, But it will be called as Compact list. Q.What is the criteria to blend the data from multiple data sources.? Ans: There should be a common dimension to blend the data source into single worksheet. For example, when blending Actual and Target sales data, the two data sources may have a Date field in common. The Date field must be used on the sheet. Then when you switch to the secondary data source in the Data window, Tableau automatically links fields that have the same name. If they don’t have the same name, you can define a custom relationship that creates the correct mapping between fields. Q.Can we use Groups and Sets in calculation field.? Ans: Groups: No, we can not use Groups in calculation fields. Sets: Yes, we can use Sets in calculation fields. Q.Difference between Grouping and Sets.? Ans: Groups – Combine dimension members into higher level categories. Sets – Create a custom field based on existing dimensions that can be used to encode the view with multiple dimension members across varying dimension levels. Q.What is context filter.? Ans: If you are applying filters to a large data source, you can improve performance by setting up context filters. A context filter is applied to the data source first, and then the other filters are applied only to the resulting records. This sequence avoids applying each filter to each record in the data source. You may create a context filter to: Improve performance – If you set a lot of filters or have a large data source, the queries can be slow. You can set one or more context filters to improve performance. Create a dependent numerical or top N filter – You can set a context filter to include only the data of interest, and then set a numerical or a top N filter. Q.What is Dual Axis.? Ans: You can compare multiple measures using dual axes, which are two independent axes that are layered on top of each other. Dual axes are useful when you have two measures that have different scales. For example, the view below shows Dow Jones and NASDAQ close values over time. To add the measure as dual axis drag the field to the right side of the view and drop it when you see a black dashed line. You can also select Dual Axis on the field menu for the measure. The two axes are independent scales but the marks are layered in the same pane. Q.What is page self..? Ans: The Pages shelf is a powerful part of Tableau that you can use to control the display of output as well as the printed result of that output. Q.Is there any new features implemented in tableau 8.0 regarding the tableau server performance improvement? Ans: Use an extract Limit your dashboard to fully answering only one scenario Limit the data being introduced to each worksheet Remove components that add no value Eliminate any non-essential components from the visualization Q.What are the other settings I need to reconfigure to get better performance as I am using 7.0 tableau server and planning to upgrade to latest versions?Suggest best configurations based on the provided server details? Ans: Tableau 8,8.1 and 8.2 also supported for 4GB ram and core processors. Q.How many viz SQL process should I run? Ans: Depending on Data Capacity Q.How many extracts (extract type) can be used on a single server(without effecting server performance like memory) ? Ans: Better 10 Q.How to check the performance step by step manner(DB, Report side, Network) in tableau report ? Ans: Go to help menu and select performance tuning option Q.How to improve the tableau report performance? Ans: If you are not planning on using a set of rows, you should filter them out of the data set as early as possible. If your table contains all sales, and you only want to look at US sales, create a Custom SQL query that filters it out. If the filter is worksheet dependent, try using a Context Filter. For more information on filtering, check out my other post Types of Filters in Tableau. You can also click the Down Arrow beside the word "Dimension" and Select "Hide All Unused Fields" to hide any fields you are not using in any of your worksheets. I'm not sure if this improves efficiency; but I'd have to imagine that it does, less data should always improve performance. Q.I have one scenario like Year in integer and week in String and wanted to calculate the YTD.. how to do this.? Ans: In Tableau, the relative date filter enables flexible analysis of time periods. Sometimes, however, you might want to see both year-to-date (YTD) and month-to-date (MTD) values for a particular measure on the same view. To accomplish this task, you can create date calculations. Create a calculated column which replaces week from string to integer and make use this in another calculation for YTD. YTD: MTD: Q.What kind of join do you see in data blending? Ans: There won't be any joins as such but we will just give the column references like primary and foreign key relation. Q.What is data blending..? When do you use this.? Ans: Data blending is when you blend data from multiple data sources on a single worksheet. The data is joined on common dimensions. Data Blending does not create row level joins and is not a way to add new dimensions or rows to your data. We use this when we want to fetch data from different sources and make use in single worksheet. Q.Can we have multiple value selection in parameter? Ans: No Q.What is Tableau Server? Ans: Tableau Server is a business intelligence solution that provides browser-based visual analytics anyone can use at just a fraction of the cost of typical BI software. With just a few clicks, you can publish or embed live, interactive graphs, dashboards and reports with current data automatically customized to the needs of everyone across your organization. It deploys in minutes and users can produce thousands of reports without the need of IT services — all within your IT infrastructure. Q.What is the Difference between connect live and import all data and Import some data.? Ans: Connect live – Creates a direct connect to your data. The speed of your data source will determine performance. Import all data – Imports the entire data source into Tableau’s fast data engine as an extract. The extract is saved with the workbook. Import some data – Imports a subset of your data into Tableau’s fast data engine as an extract. This option requires you to specify what data you want to extract using filters. Q.What does Tableau do? Ans: Our goal is to help people see and understand data. Our software products put the power of data into the hands of everyday people, allowing a broad population of business users to engage with their data, ask questions, solve problems and create value. Q.Who are Tableau’s customers? Ans: Our products are used by people of diverse skill levels across all kinds of organizations, including Fortune 500 corporations, small and medium-sized businesses, government agencies, universities, research institutions and non-profits. Organizations employ our products in a broad range of use cases such as increasing sales, streamlining operations, improving customer service, managing investments, assessing quality and safety, studying and treating diseases, completing academic research, addressing environmental problems and improving education. Q.When was Tableau founded? Ans: Tableau was founded in 2003 by Christopher Stolte, Patrick Hanrahan and Christian Chabot. Q.When did Tableau go public and where does its stock trade? Ans: Tableau Software is traded on NYSE under the ticker symbol DATA. The company went public on May 17, 2013 at an initial public offering price of $31 per share. Q.What are the main features of Tableau Software? Ans: Tableau Software offers the following features: Reporting & Dashboard Data Analytics Analytics Business Intelligence Ad Hock Query Data Hierarchy Data Mining Multi-Dimensional Analytics ODBC OLAP Analytics on Share Point, the iPad, Android tablets Q.What are the main benefits of using Tableau Software ? Ans: Tableau Software offers the following benefits: Tableau Software is faster than other solutions. Tableau Software is an intuitive, drag-and-drop tool that lets you see every change as you make it. With Tableau Software you can build smart, fit and beautiful dashboards. In Tableau you can connect directly to databases, cubes, data warehouses, files and spreadsheets. contact for more on Tableau Online Training
Continue reading
Teradata Interview Questions
Q.How do you Generate sequence at the time of Display? Answer: By Using CSUM Q.How do you Generate Sequence in Teradata? Answer: By Using Identity Column 1-for storing purpose using identity. 2-for display purpose using csum. Q.How do you load Multiple files to a table by using fast load scripts? Answer: Loading statement in the script and Replace the file one by one in the script till last file and submit every time so that data appended in Amp Level.For the last file specify End Loading statement in the script and Run.so that it runs from Amp to table. Q.What is the difference between FastLoad and MultiLoad? Answer: FastLoad uses multiple sessions to quickly load large amount of data on empty table. MultiLoad is used for high-volume maintenance on tables and views. It works with non-empty tables also. Maximum 5 tables can be used in MultiLoad. Q.Which is faster? Answer: FastLoad. Q.Difference between Inner join and outer join? Answer: An inner join gets data from both tables where the specified data exists in both tables. An outer join gets data from the source table at all times, and returns data from the outer joined table ONLY if it matches the criteria. Q.What is multi Insert? Answer: Inserting data records into the table using multiple insert statements. Putting a semi colon in front of the key word INSERT in the next statement rather than terminating the first statement with a semi colon achieves it. Insert into Sales “select * from customer” ; Insert into Loan “select * from customer”; Q.Is multi insert ANSI standard? Answer: No. Q.How do you create a table with an existing structure of another table with data and with no data? Answer: Create table Customerdummy as Customer with data / with no data; Q.What is the opening step in Basic Teradata Query script? Answer: Logon tdipid/username, password. Q.You are calling a Bteq script, which drops a table and creates a table. It will throw an error if the table does not exist. Q.How can you do it without throwing the error? Answer: You can it by setting error level to zero before dropping and resetting the error level to 8 after dropping. You can do it like this ERRORLEVEL (3807) SEVERITY 0; DROP TABLE EMPLOYEE; ERRORLEVEL (3807) SEVERITY 8; Q.Can you FastExport a field, which is primary key by putting equality on that key? Answer: No. Q.Did you write stored procedures in Teradata? Answer: No, because they become a single amp operation and my company didn’t encourage that. Q.What is the use of having index’s on table? Answer: For faster record search. Q.Did you use Query man or SQL assistance? Answer: SQL assistant 6.1 Q.I am updating a table in Bteq. It has to update a large number of rows, so it’s really slow. What do you suggest? Answer: In Teradata it is not recommended to update more than 1 million rows due to journal space problems, if it is less than that and it’s slow in the Bteq, you might want to add collect statistics statement before the update statement. Q.Is it necessary to add? QUIT statement after a Bteq query when I am calling it in a Unix environment? Answer: Not necessary but it is good to add a QUIT statement after a query. Q.There is a column with date in it. If I want to get just month how It can be done? Can I use sub string? Answer: Sub string is used with char fields. So it cannot be used. To extract month from a date column, ex select extract (month from ). Same thing for year or day. Or hour or minutes if it’s a time stamp (select extract (minute from column name). Q.What’s the syntax of sub string? Answer: SUBSTRING (string_expression, n1 ) Q.Did you use CASE WHEN statement. Can you tell us a little about it? Answer: Yes. When a case has to be selected depending upon the value of the expression. Q.While creating table my DBA has FALLBACK or NO FALLBACK in his DDL. What is that? Answer: FALLBACK requests that a second copy of each row inserted into a table be stored on another AMP in the same cluster. This is done when AMP goes down or disk fails. Q.My table got locked during MLOAD due to a failed job. What do I do to perform other operations on it? Answer: Using RELEASE MLOAD. It removes access locks from the target tables in Teradata. It must be entered from BTEQ and not from MultiLoad. To proceed, you can do RELEASE MLOAD Q.How to find duplicates in a table? Answer: Group by those fields and select id, count(*) from table group by id having count (*) > 1 Q.How to you verify a complicated SQL? Answer: I use explain statement to check if the query is doing what I wanted it to do. Q.How many tables can you join in V2R5 Answer: Up to 64 tables. Q.Did u ever use UPPER Function? Answer: UPPER Function is used to convert all characters in a column to the same characters in upper case. Q.What does a LOWER Function do? Answer: LOWER function is used to convert all characters in a column to the lower case characters. Q.How do you see a DDL for an existing table? Answer: By using show table command. Q.Which is more efficient GROUP BY or DISTINCT to find duplicates? Answer: With more duplicates GROUP BY is more efficient, if only a few duplicates exist DISTINCT is more efficient. Q.Syntax for CASE WHEN statement? Answer: CASE value_expression_1 WHEN value_expression_n THEN scalar_expression_n END; Q.What’s the difference between TIMESTAMP (0) and TIMESTAMP (6)? Answer: TIMESTAMP (0) is CHAR (19) and TIMESTAMP (6) is CHAR (26)Everything is same except that TIMESTAMP (6) has microseconds too. Q.How do you determine the number of sessions? Answer: Teradata performance and workloadClient platform type, performance and workloadChannel performance for channel attached systemsNetwork topology and performance for network attached systems.Volume of data to be processed by the application. Q.What is node? How many nodes and AMPs used in your previous project? Answer: Node is a database running in a server. We used 318 nodes and each node has 2 to 4 AMPS. Q.What is a clique? Answer: Clique is a group of disk arrays physically cabled to a group of nodes. Q.Interviewer explained about their project (Environment, nature of work) Answer: Listen to them carefully so that at the end of the interview you can ask questions about the project when you are given a chance to ask questions. Q.Tell us something about yourself? Answer: Describe about your project experience, technical skill sets, hard working, good team player, self-learner and self-motivated. Q.What is the best project you ever worked with and why it is best project? Answer: All the projects I worked so far are best projects. I treat every project is equal and work hard for the success of the project. Q.What makes a project successful and how you have contributed to the success of the project? Answer: Good team members, technical knowledge of team members, hard work, sharing knowledge among the team, individual’s contribution to the project. Explain them that you posses all the skills you mentioned above. Q.Have you worked under stress and how did you handle it? Answer: Yes. Many times to deliver the project on schedule, we were under lot of pressure. During those times we work extra hours and help each other in the team to deliver the project on schedule. Team effort is key factor for the success of the project. Q.What is the difference between FastLoad and MultiLoad? Answer: FastLoad uses multiple sessions to quickly load large amount of data on empty table.MultiLoad is used for high-volume maintenance on tables and views. It works with non-empty tables also. Maximum 5 tables can be used in MultiLoad. Q.Have you used procedures? Answer: No. I have not used procedures. But I have expertise knowledge writing procedures. My company have not encouraged me to write procedures because it becomes single AMP operation, as such uses lot of resources and expensive in terms of resource and time. Q.What is the purpose of indexes? Answer: An index is a mechanism that can be used by the SQL query optimizer to make table access more performant. Indexes enhance data access by providing a more-or-less direct path to stored data and avoiding the necessity to perform full table scans to locate the small number of rows you typically want to retrieve or update. Q.What is primary index and secondary index? Answer: Primary index is the mechanism for assigning a data row to an AMP and a location on the AMP’s disks. Indexes also used to access rows from a table without having to search the entire table.Secondary indexes enhance set selection by specifying access paths less frequently used than the primary index path. Secondary indexes are also used to facilitate aggregate operations. If a secondary index covers a query, then the Optimizer determines that it would be less costly to accesses its rows directly rather than using it to access the base table rows it points to. Sometimes multiple secondary indexes with low individual selectivity can be overlapped and bit mapped to provide enhanced Q.What are the things to considered while creating secondary index? Answer: Creating a secondary index causes Teradata to build a sub-table to contain its index rows, thus adding another set of rows that requires updating each time a table row is inserted, deleted, or updated. Secondary index sub-tables are also duplicated whenever a table is defined with FALLBACK, so the maintenance overhead is effectively doubled. Q.What is collect statistics? Answer: Collects demographic data for one or more columns of a table, hash index, or join index, computes a statistical profile of the collected data, and stores the synopsis in the data dictionary. The Optimizer uses the synopsis data when it generates its table access and join plans. Q.Can we collect statistics on multiple columns? Answer: Yes we can collect statistics on multiple columns. Q.Can we collect statistics on table level? Answer: Yes we can collect statistics on table level. The syntax is COLLECT STAT ON TAB_A; Q.What is inner join and outer join? Answer: An inner join gets data from both tables where the specified data exists in both tables.An outer join gets data from the source table at all times, and returns data from the outer joined table ONLY if it matches the criteria. Q.When Tpump is used instead of MultiLoad? Answer: TPump provides an alternative to MultiLoad for the low volume batch maintenance of large databases under control of a Teradata system. Instead of updating Teradata databases overnight, or in batches throughout the day, TPump updates information in real time, acquiring every bit of data from the client system with low processor utilization. It does this through a continuous feed of data into the data warehouse, rather than the traditional batch updates. Continuous updates results in more accurate, timely data. And, unlike most load utilities, TPump uses row hash locks rather than table level locks. This allows you to run queries while TPump is running. This also means that TPump can be stopped instantaneously. As a result, businesses can make better decisions that are based on the most current data. Q.What is spool space and when running a job if it reaches the maximum spool space how you solve the problem? Answer: Spool space is used to hold intermediate rows during processing, and to hold the rows in the answer set of a transaction. Spool space reaches maximum when the query is not properly optimized. Use appropriate conditions in WHERE clause of the query to limit the answer set. Q.What is your level of expertise in using MS office suite? Answer: Expert level. Using it for last 8 years for documentation. Q.Have you used Net meeting? Answer: Yes. Used net meeting for team meeting when members of the team geographically in different locations. Q.Do you have any question? Answer: What is the team size going to be? What is the current status of the project? What is the project schedule? Q.What is your available date?Answer: Immediate. Or your available date for the project. Q.How much experience you have in MVS? Answer: Intermediate. In my previous two projects used MVS to submit JCL jobs. Q.Have you created JCL script from scratch? Answer: Yes. I have created JCL scripts from scratch while creating jobs in the development environment. Q.Have you modified any JCL script and used?Answer: Yes I have modified JCL scripts. In my previous projects many applications were re-engineered so the existing JCL scripts were modified according to the company coding standards.Q.Rate yourself on using Teradata tools like BTEQ, Query man, FastLoad, MultiLoad and Tpump!Answer: Intermediate to expert level. Extensively using for last 4 years. Also I am certified in Teradata.Q.Which is your favorite area in the project?Answer: I enjoy every working on every part of the project. Volunteer my time for my peers so that I can also learn and contribute more towards the project success.Q.What is data mart?Answer: A data mart is a special purpose subset of enterprise data used by a particular department, function or application. Data marts may have both summary and details data, however, usually the data has been pre aggregated or transformed in some way to better handle the particular type of requests of a specific user community. Data marts are categorized as independent, logical and dependant data marts.Q.Difference between star and snowflake schemas?Answer: Star schema is De-normalized and snowflake schema is normalized. Q.Why are OLTP database designs not generally a good idea for a Data Warehouse?Answer: OLTP designs are for real time data and they are not normalized and pre-aggregated. They are not good for decision support systems.Q.What type of Indexing mechanism do we need to use for a typical data warehouse?Answer: Primary Index mechanism is the ideal type of index for data warehouse.Q.What is VLDB?Answer: Very Large databases. Please find more information on it.Q.What is real time data warehousing?Answer: Real-time data warehousing is a combination of two things: 1) real-time activity and 2) data warehousing. Real-time activity is activity that is happening right now. The activity could be anything such as the sale of widgets. Once the activity is complete, there is data about it. Data warehousing captures business activity data. Real-time data warehousing captures business activity data as it occurs. As soon as the business activity is complete and there is data about it, the completed activity data flows into the data warehouse and becomes available instantly. In other words, real-time data warehousing is a framework for deriving information from data as the data becomes available.Q.What is ODS?Answer: An operational data store (ODS) is primarily a "dump" of relevant information from a very small number of systems (often just one) usually with little or no transformation. The benefits are an ad hoc query database, which does not affect the operation of systems required to run the business. ODS’s usually deal with data "raw" and "current" and can answer a limited set of queries as a result.Q.What is real time and near real time data warehousing?Answer: The difference between real time and near real time can be summed up in one word: latency. Latency is the time lag that is between an activity completion and the completed activity data being available in the data warehouse. In real time, the latency is negligible whereas in near real time the latency is a tangible time frame such as two hours. Q.What are Normalization, First Normal Form, Second Normal Form and Third Normal Form? Answer: Normalization is the process of efficiently organizing data in a database. The two goals of the normalization process are eliminate redundant data (storing the same data in more than one table) and ensure data dependencies make sense (only storing related data in the table).First normalization form:Eliminate duplicate columns from the same table.Create separate tables for each group of related data and identify each row with a unique column or set of columns (primary key)Second normal form:Removes sub set of data that apply to multiple rows of table and place them in separate table.Create relationships between these new tables and their predecessors through the use of foreign keys.Third normal form:Remove column that are not dependent upon the primary key.Q.What is fact table?Answer: The centralized table in a star schema is called as FACT table i.e. a table in that contains facts and connected to dimensions. A fact table typically has two types of columns: those that contain facts and those that are foreign keys to dimension tables. The primary key of a fact table is usually a composite key that is made up of all of its foreign keys. A fact table might contain either detail level facts or facts that have been aggregated (fact tables that contain aggregated facts are often instead called summary tables). In the real world, it is possible to have a fact table that contains no measures or facts. These tables are called as Factless Fact tables.Q.What is ETL?Answer: Extract, transformation, and loading. ETL refers to the methods involved in accessing and manipulating source data and loading it into target database. The first step in ETL process is mapping the data between source systems and target database (data warehouse or data mart). The second step is cleansing of source data in staging area. The third step is transforming cleansed source data and then loading into the target system. Note that ETT (extract, transformation, transportation) and ETM (extraction, transformation, move) are sometimes used instead of ETL.Q.What is ER diagram?Answer: It is Entity relationship diagram. Describes the relationship among the entities in the database model.Q.What is data mining?Answer: Analyzing of large volumes of relatively simple data to extract important trends and new, higher level information. For example, a data-mining program might analyze millions of product orders to determine trends among top-spending customers, such as their likelihood to purchase again, or their likelihood to switch to a different vendor.Q.What is Star schema?Answer: Star Schema is a relational database schema for representing multi-dimensional data. It is the simplest form of data warehouse schema that contains one or more dimensions and fact tables. It is called a star schema because the entity-relationship diagram between dimensions and fact tables resembles a star where one fact table is connected to multiple dimensions. The center of the star schema consists of a large fact table and it points towards the dimension tables. The advantages of star schema are slicing down, performance increase and easy understanding of data.Q.What is a level of Granularity of a fact table?Answer: The components that make up the granularity of the fact table correspond directly with the dimensions of the data model. Thus, when you define the granularity of the fact table, you identify the dimensions of the data model. The granularity of the fact table also determines how much storage space the database requires. For example, consider the following possible granularities for a fact table:Product by day by region Product by month by regionThe size of a database that has a granularity of product by day by region would be much greater than a database with a granularity of product by month by region because the database contains records for every transaction made each day as opposed to a monthly summation of the transactions. You must carefully determine the granularity of your fact table because too fine a granularity could result in an astronomically large database. Conversely, too coarse granularity could mean the data is not detailed enough for users to perform meaningful queries against the database.Q.What is a dimension table?Answer: Dimension table is one that describes the business entities of an enterprise, represented as hierarchical, categorical information such as time, departments, locations, and products. Dimension tables are sometimes called lookup or reference tables. In a relational data modeling, for normalization purposes, country lookup, state lookup, county lookup, and city lookups are not merged as a single table. In a dimensional data modeling (star schema), these tables would be merged as a single table called LOCATION DIMENSION for performance and slicing data requirements. This location dimension helps to compare the sales in one region with another region. We may see good sales profit in one region and loss in another region. If it is a loss, the reasons for that may be a new competitor in that area, or failure of our marketing strategy etc.Q.What are the various Reporting tools in the Market?Answer: Crystal reports, Business objects, micro strategy and etc.,Q.What are the various ETL tools in the Market?Answer: Ab Initio, Informatica and etc.,Q.What is a three-tier data warehouse?Answer: The three-tier differs from the two-tier architecture by strictly enforcing a logical separation of the graphical user interface, business logic, and data. The three-tier is widely used for data warehousing today. Organizations that require greater performance and scalability, the three-tier architecture may be more appropriate. In this architecture, data extracted from legacy systems is cleansed, transformed, and stored in high –speed database servers, which are used as the target database for front-end data access.Q.Importance of Surrogate Key in Data warehousing?Answer: Surrogate Key is a Primary Key for a Dimension table. Most importance of using it is independent of underlying database. i.e. Surrogate Key is not affected by the changes going on with a databaseQ.Differentiate Primary Key and Partition Key?Answer: Primary Key is a combination of unique and not null. It can be a collection of key values called as composite primary key. Partition Key is a just a part of Primary Key. There are several methods of partition like Hash, DB2, and Random etc. While using Hash partition we specify the Partition Key. Q.Differentiate Database data and Data warehouse data? Answer: Data in a Database is Detailed or Transactional, Both Readable and Write able and current.Data in data warehouse is detailed or summarized, storage place for historical data. Q.What are OLAP, MOLAP, ROLAP, DOLAP and HOLAP? Examples?Answer: OLAP:OLAP stands for Online Analytical Processing. It uses database tables (fact and dimension tables) to enable multidimensional viewing, analysis and querying of large amounts of data. E.g. OLAP technology could provide management with fast answers to complex queries on their operational data or enable them to analyze their company's historical data for trends and patterns. MOLAP:Stands for Multidimensional OLAP. In MOLAP cubes the data aggregations and a copy of the fact data are stored in a multidimensional structure on the Analysis Server computer. It is best when extra storage space is available on the Analysis Server computer and the best query performance is desired. MOLAP local cubes contain all the necessary data for calculating aggregates and can be used offline. MOLAP cubes provide the fastest query response time and performance but require additional storage space for the extra copy of data from the fact table. ROLAP:Stands for Relational OLAP. In ROLAP cubes a copy of data from the fact table is not made and the data aggregates are stored in tables in the source relational database. A ROLAP cube is best when there is limited space on the Analysis Server and query performance is not very important. ROLAP local cubes contain the dimensions and cube definitions but aggregates are calculated when they are needed. A ROLAP cube requires less storage space than MOLAP and HOLAP cubes. HOLAP:Stands for Hybrid OLAP. A HOLAP cube has a combination of the ROLAP and MOLAP cube characteristics. It does not create a copy of the source data however; data aggregations are stored in a multidimensional structure on the Analysis Server computer. HOLAP cubes are best when storage space is limited but faster query responses are needed Q.What is OLTP? Answer: OLTP stands for Online Transaction Processing. OLTP uses normalized tables to quickly record large amounts of transactions while making sure that these updates of data occur in as few places as possible. Consequently OLTP database are designed for recording the daily operations and transactions of a business. E.g. a timecard system that supports a large production environment must record successfully a large number of updates during critical periods like lunch hour, breaks, startup and close of work. Q.What is staging area? Answer: The data staging area is a system that stands between the legacy systems and the analytics system, usually a data warehouse and sometimes an ODS. The data staging area is considered the "back room" portion of the data warehouse environment. The data staging area is where the extract, transform and load (ETL) takes place and is out of bounds for end users. Some of the functions of the data staging area include:Extracting data from multiple legacy systemsCleansing the data, usually with a specialized toolIntegrating data from multiple legacy systems into a single data warehouseTransforming legacy system keys into data warehouse keys, usually surrogate keysTransforming disparate codes for gender, marital status, etc., into the data warehouse standardTransforming the heterogeneous legacy data structures to the data warehouse data structuresLoading the various data warehouse tables via automated jobs in a particular sequence through the bulk loader provided with the data warehouse database or a third-party bulk loader Q.What is subject area? Answer: Subject area means fundamental entities that make up the major components of the business, e.g. customer, product, employee.61. Q.What is tenacity? Answer: Number of hours Teradata utility will try to establish a connection to the system. Default is 4 hours. Q.What is a checkpoint? Answer: Checkpoints are entries posted to a restart log table at regular intervals during the data transfer operation. If processing stops while a job is running, you can restart the job at the most recent checkpoint. Q.What is slowly changing dimension? Answer: In a slowly changing dimension the attribute for a record varies over time. There are three ways to solve this problem.Type 1 – Replace an old record with a new record. No historical data available.Type 2 – Keep the old record and insert a new record. Historical data available but resources intensive.Type 3 – In the existing record, maintain extra columns for the new values.Q.What is sleep?Answer: Number of minutes the Teradata utility will wait between logon attempts. Default is 6 minutes.Q.Difference between MultiLoad and TPump?Answer: Tpump provides an alternative to MultiLoad for low volume batch maintenance of large databases under control of a Teradata system. Tpump updates information in real time, acquiring every bit of a data from the client system with low processor utilization. It does this through a continuous feed of data into the data warehouse, rather than the traditional batch updates. Continuous updates results in more accurate, timely data. Tpump uses row hash locks than table level locks. This allows you to run queries while Tpump is running.Q.Different phases of MultiLoad?Answer: Preliminary phaseDML phaseAcquisition phaseApplication phaseEnd phaseQ.Explain modifier!Answer:The explain modifier generates an English translation of the parser’s plan. It is fully parsed and optimized but not executed. Explain returnsText showing how a statement will be processed.As estimate of how many rows will be involvedA relative cost of the request in units of time.This information is useful for predicting row counts, predicting performance, testing queries before production and analyzing various approaches to a problem.Q.Difference between oracle and Teradata warehouse!Answer: Teradata can handle multi tera bytes of data. Teradata is linearly expandable, uses matured optimizer, shared nothing architecture. Uses data parallelism.The Teradata DBA’s never have to reorganize data or index space, pre-allocate table/index space, format partitions, tune buffer space, ensure the queries run in parallel, pre-process data for loading and write or run programs to split the input data into partitions for loading.Q.What is dimensional modeling?Answer: Dimensional Data Modeling comprises of one or more dimension tables and fact tables. Good examples of dimensions are location, product, time, promotion, organization etc. Dimension tables store records related to that particular dimension and no facts (measures) are stored in these tablesQ.How will you solve the problem that occurs during update?Answer: When there is an error during the update process, an entry is posted in the error log table. Query the log table and fix the error and restart the job.Q.Can you connect MultiLoad from Ab Initio?Answer: Yes we can connect.Q.What interface is used to connect to windows based applications?Answer: WinCLI interface.Q.What is data warehousing?Answer: A data warehouse is a subject oriented, integrated, time variant, non-volatile collection of data in support of management's decision-making process.Q.What is data modeling?Answer: A Data model is a conceptual representation of data structures (tables) required for a database and is very powerful in expressing and communicating the business requirements.Q.What is logical data model?Answer: A Logical data model is the version of a data model that represents the business requirements (entire or part) of an organization and is developed before the physical data model. A sound logical design should streamline the physical design process by clearly defining data structures and the relationships between them. A good data model is created by clearly thinking about the current and future business requirements. Logical data model includes all required entities, attributes, key groups, and relationships that represent business information and define business rules.Q.Tell us something about data modeling tools?Answer: Data modeling tools to transform business requirements into logical data model, and logical data model to physical data model. From physical data model, these tools can be instructed to generate SQL code for creating database entities.Q.Steps to create a data model?Answer: Get business requirements.Create High Level Conceptual Data Model.Create Logical Data Model.Select target DBMS where data-modeling tool creates the physical schema.Create standard abbreviation document according to business standardQ.What is the maximum number of DML can be coded in a MultiLoad script?Answer: Maximum 5 DML can be coded in a MultiLoad script.Q.There is a load to the Table every one hour and 24/7.Morning trafic is high, afternoon trafiic is less, Night trafiic is high.According to this situation which Utility you use and how do you load,which utility used? Answer: Tpump is suggestable here By Using packet size increasing or decreasing we can handle traffic. Q.Fast Load Script is failed and error tables are available then how do you restart? Answer: There are 2 ways 1.Incase of Old file to Run Dont drop error tables simply rectify the error in the script or file and Run again.so that it runs from last configuration. 2.Incase of new file to Run Drop Error tables Try to run the script with only begin and end loading statements. ,so that it releases the lock on target table If possible remove the record from fastlog table. Run the script with new file freshly. Q.What are the advantages of other ETL tools(Informatica,DataStage,SSIS etc...) than Teradata Utilities or vice versa? Answer: TD Utilities run faster than other ETL tools incase of File to Table(Import) or Table to File (Export) Operations. Inmod and Outmod also help us to do better programing while Importing and Exporting. Q.What are the major advantages of Other ETL tools than TD Utilities? Answer: 1.Heterogeneous Sources and Destinations we can Operate. (Oracle,Sqlserver,Excel,Flat Files etc....) 2.As they are providing full GUI support,degugging is so easier. 3.Reusability of components(Informatica:mapplets,worklets etc...) available in ETL tools so if we change in main location automatically other applications(mappings) which are using these components can update instantly. 4.Pivoting(Normalization),Unpivot(Denormalization) we can implement very easily in ETL tools. 5.Caching is one more advantage when we work with a source (Heterogenous) which is not changing frequently.Some times cache can be shared across applications etc.... Q.Aborted in Phase 2 data acquisition completed in fastload? Answer: Simply take Begin and End loading in scripts and Run again.so that it runs from Amp to table. Q.How do you Generate sequence at the time of Display? Answer: By Using CSUM Q.How do you Generate Sequence in Teradata? Answer: By Using Identity Column 1-for storing purpose using identity. 2-for display purpose using csum. Q.How do you load Multiple files to a table by using fast load scripts?Answer: Remove End Loading statementin the script and Replace the file one by one in the script till last file and submit every time so that data appended in Amp Level.For the last file specify End Loading statement in the script and Run.so that it runs from Amp to table. Q.Why Multiload and Fastload does not supports SI,JI,RI and TRIGGERS? Answer: Above all concepts require communication between Multiple Amps. As per Fastload and Multiload doesnot provide any communication b/w multiple Amps and they Operate Independently.As concept is violating as well as it takes time to implement the above operation they are not allowing them. Q.Why Multiload does not supports USI and supports NUSI? Answer: Index subtable row is on the same Amp as the data row in NUSI.so it operates each Amp independently and in parallel. Q.While Executing MLOAD Client System Restarted? Answer: We need to Manualy Submit the script,so that it loads data from last checkpoint. Q.While Executing MLOAD Teradata Server Restarted? Answer: Along with the server MLAOD script will also restarted and Runs from last check point. Q.There is file it contains 100 records,need to load 60 records by skipping first 20 records and last 20 records. Answer: use BTEQ Utility to do this task by Skip = 20 and Repeat 60 in script.Q.How to see current teradata version? Answer: .SHOW VERSION Q.What is Node? Answer: A Node is nothing but Collection of Hardware and Software components.Typically a Server is called as node. Q.What is PDE? Answer: Parallel Data Extension A software interface layer on top of the operating system that enables the database to operate in a parallel environment. It was created by NCR to support the parallel environment. Q.What is Trusted parallel database (TPD)? Answer: A database is called TPD if it runs under PDE. Teradata is a database which runs under PDE.so we call Teradata as pure Parallel database or Trusted parallel database. Q.What is channel driver? Answer: A Software that communicates between the PEs and applications running on channel-attached clients. Q.What is Teradata Gateway? Answer: Terdata Gateway software gives communication between the application and the PEs assigned to network-attached clients. There is one Gateway per node. Q.What is Virtual disk? Answer: A collection of Cylinders(physical disks) arranged in an Array fashion is called Vdisk or Virtual disk.Traditionally this is called as disk Array or Array of disk. Q.What is Amp? Answer: Access Module ProcessorIt is a Virtual processor responsible for managing one portion of the database(collection of virtual disks).This portion is not sharable by any other AMP.so well call this Architecture as shared nothing Architecture. Amp contains Database Manager subsystem and it performs the below operations a.Performing DDL b.Performing DML c.Implementing Joins,Aggregations. d.Applying and Releasing Locks etc. Q.What is Parsing Engine? Answer: PE is type of Vproc it takes SQL request and delivers SQL response. It has software components to break SQL into steps, and send the steps to the AMPs. Session Control A session is nothing but logical connection between User and Application.Here it controls Authorization if its a Valid it does log on otherwise log off. Parser checks syntactical errors checks semantical errors checks existence of objects. Dispatcher It takes set of request and keep in a queue ,delivers set of responses by keeping the same queue that means it does request response flow control. Q.How many MAX session PE handles at a time? Answer: PE handles MAX 120 sessions.Q.open batch session got failed because of the following error.WRITER_1_*_1> WRT_8229 Database errors occurred:FnName: Execute -- Duplicate unique prime key error in CFDW2_DEV_CNTL.CFDW_ECTL_CURRENT_BATCH.FnName: Execute -- Function sequence errorAnswer: When ever you want to open a fresh batch id, first of all you should close the existing batch id and open a fresh batch id.Q.source is Flat file and I am staging the this flat file in teradata.I found that the initial zero’s are truncating in teradata. What could be the reason.Answer: The reason is that in teradata you are defined the column datatype as Integer. That’s why initial values are truncating. So, change the target table data type to VARCHAR. VARCHAR datatype it won’t trucate the initial zero’s.Q.Can`t determine current batch ID for Data Source 47Answer: For any fresh stage load you should open a batch id for the current data source id.Q.Unique Primary key violation CFDW_ECTL_CURRENT_BATCH table.Answer:In CFDW_ECTL_CURRENT_BATCH table unique primary key defined on ECTL_DATA_SRCE_ID,ECTL_DATA_SRCE_INST_ID columns. At any point of time you shold have only one record for ECTL_DATA_SRCE_ID,ECTL_DATA_SRCE_INST_ID columns.Q.can’t insert a NULL value in a NOT NULL column.Answer: First find all the NOT NULL columns in a target table and cross verify with the corresponding source columns and identify for which source column you are getting NULL value and take necessary action..Q.source is Flat file and I am staging the this flat file in teradata.I found that the initial zero’s are truncating in teradata. What could be the reason.Answer:The reason is that in teradata you are defined the column datatype as Integer. That’s why initial values are truncating. So, change the target table data type to VARCHAR. VARCHAR datatype it won’t trucate the initial zero’s.Q.I am passing one record to target look up but the look up is not returning the matching record.I know that the record is present in loo up. What action you will take ?Answer: use LTRIM,RTRIM in look up sql override.this will remove the unwanted blank spaces. Then look up will find the matching record in look up.Q.I am getting duplicate records for natural key (ECTL_DATA_SRCE_KEY) what will you do to eliminate duplicate records natural key.Answer:we will concatenate 2 ,3 or more source columns and check for duplicate records. If you are not getting duplicates after concatenating then use those columns to populate ECTL_DATA_SRCE_KEY column in target.Q.Accti_id is a Not null column in AGREEMENT table. You are getting a NULL value from CFDW_AGREEMENT_XREF look up ? what will you do to eliminate NULL records.Answer:After stage load, I will populate CFDW_AGREEMENT_XREF table (this table basically contain surrogate keys). Once you populate XREF table then you won’t get any NULL recordsAccti_id column.Q.Unique primary key violation on CFDW_ECTL_BATCH_HIST table.Answer:In CFDW_ECTL_BATCH_HIST table Unique primary index defined on ectl_btch_id column. So, there should be only one uniue record for a ectl_btch_id column.Q.when will you use ECTL_PGM_ID column in target look up sql overirde ?Answer: when you are populating a single target table (AGREEMENT table) from multiple mappings in the same informatica folder then we will use ECTL_PGM_ID in taget look up sql override. This will eliminate unnecessary updating records.Q.you are defined the primary keys as per the ETL spec but you are getting the duplicate records. How will you handle.Answer: Apart from the primary key columns in the spec,First I will add any other column (other primary key columns in spec) as the primary key and I will check for the duplicate records. If I didn’t get any duplicates, I will ask modeller to add this column as the primary key.Q.In teradata the error is mentioned as: “no more room in database”Answer:I spoke with DBA to add the space for that database.Q.Though the column is available in target table, when I am trying to load using Mload, it shows that tahe column is not available in the table. Why?Answer:As the loading process was happening through a view and the view was not refreshed to add the new column, it was the error message. So, refresh the view definition to add the new column.Q.when deleting the target table, though I wante to delete some data from the target table, by mistake all the data got deleted from Development table.Answer:Add ECTL_DATA_SRCE_ID and PGM_ID in the where clause of the query.Q.While updatating the target table, it shows an error message saying multiple rows are trying to update a single row.Answer:There are duplicates available in the table matching the Where condition of the update qurey. These duplicate records need to be eliminated.Q.I have a file with header, data records and trailer. Data record is delimited with comma and header and trailer are fixed width. The header and trailer starts with (HDR,TRA). I need to avoid the header and trailer while loading the file with Multiload. Please help me in this case. Answer:Code Mload utility to consider only the data records excluding the header and trailer records.Q.What is BYNET? Answer: It acts like a "Message Communication" happens between Components.It is responsible for 1. Sending messages 2. Merging data 3.Sorting data Q.What is Clique? Answer: It prevents from Node Failure. 1..A Clique is a collection of Nodes will Share Common Disk drives. 2.whenever any node went down automatically Vprocs all migrate from fail node to other node to retrieve data from common disk drives. Q.List different types of LOCKS in teradata? Answer: Teradata can apply four types of LOCKS a.Access Lock b.Read lock c.Write Lock d.Exclusive Lock Q.At what level teradata can apply a LOCK? Answer: 1.Database level lock --- All objects inside database will be locked. 2.Table level -- All rows inside a table will be locked. 3,Row hash level lock-- Only Corresponding row will be locked. Q.How many AMPs involved in case of Primary Index? Answer: Its always one AMP.Q.What about UPSERT command in teradata? Answer: UPSERT means update else insert. In Teradata we have this option. Q.What is Advantage of PPI? Answer: Mainly we use for Range based data storing or category based Data storing. Range queries don't have to utilize a Full Table Scan.Its directly goes to the corresponding partition and skips other partitions. Fastload and Multiload work with PPI tables, but not with all Secondary Indexes. Q.What is the size of BYTEINT,SMALLINT,INTEGER? BYTEINT - 1 Bytes - 8 bits -- > -128 to 127 SMALLINT - 2 bytes - 16 bites: ---> -32768 to 32767 INTEGER - 4 BYTES - 32 BITS ---> -2,147,483,648 to 2,147,483,647 Q.Difference between user and database in teradata Database User ? A database is Passive Repository User is a Active. It stores all database objects It stores any object such as table,macro,view etc. It does not contain Password. It contains password Q.Difference between primary key and Primary Index? Answer:Primary Index Primary Key It is mandatory It is optional Limit of 64 columns/table No limit Allows Nulls & Duplicates Doesnt allows It is physical mechanism Logical mechanism Effects Data Distribution Does not effect Data Distribution Q.What is the use of Spool Space? Answer: Teradata spool space is unused Perm space that is used for running queries.Teradata recommend 20 % of the available perm space is allocated for spool space. This is used to hold intermittent results of the queries and volatile tables. Q.What is the need of Performance tuning? Answer: To identify bottlenecks and to resolve it we go for Performance tuning. Bottle neck is not an error but it causes system delay in Performance. example: There is a query it is suppose to run in 2 mins but executed for half an hour and finally succeeded. In this situation we need to identify bottlenecks and resolve it. To Identify bottlenecks we go for a.Explain Request Modifier b.Teradata Visual Explain c.Performance monitor d.Teradata Manager Q.Define Explain plan?Answer: Explain plan displays the execution plan of SQL statement that is going to be executed by the database.This plan will be specified by the component called optimiser.Generaly it displays below information a.Number of Amps b.Amount of spool memory it is occupying. c.Number of Rows its affecting. d.Type of Join strategy it is taking. e.Time it takes to execute. f.Locks it is Using etc. Syntax : EXPLAIN Example : EXPLAIN SEL * FROM PARTY; Q.What is Collect statistics?Collect stats just derives the data demographics of the table.Collect stats is an important concept in teradata, Collect stats gives PE to come up with a plan with least cost plan for an requested query. Collect stats defines the confidence level of PE in estimating "how many rows it is going to access ? how many unique values does a table have , null values etc and all this info is stored in data dictionary. Once you submit a Explain plan query in TD the parsing engine checks if the stats are available for the requested table . If collected stats available already PE generates a plan with "high confidence" . If Collect stats unavailable gives "low confidence" . Syntax : COLLECT STATISTICS ON INDEX/COLUMN NAME Q.What is Least Cost Plan? Answer: It executes in shortest path with less time. Q.What is Highest Cost Plan? Answer: It executes in Longest path with more time. Q.How many Confidence Level present? Answer: a.Low b.No c.High d.Join Q.If collect stats is not done on the table what will happen? Answer: Teradata uses a cost based optimizer and cost estimates are done based on statistics. So if you donot have statistics collected then optimizer will use a Dynamic AMP Sampling method to get the stats. If your table is big and data was unevenly distributed then dynamic sampling may not get right information and yourperformance will suffer. Q.What are the 5 phases in a MultiLoad Utility?Answer: * Preliminary Phase – Basic Setup * * DML Phase – Get DML steps down on AMPs* * Data Acquisition Phase – Send the input data to the AMPs and sort it* * Application Phase – Apply the input data to the appropriate Target Tables* * End Phase – Basic Cleanup* Q.What are the MultiLoad Utility limitations? Answer: MultiLoad is a very powerful utility; it has following limitations: * MultiLoad Utility doesn’t support SELECT statement. * Concatenation of multiple input data files is not allowed. * MultiLoad doesn’t support Arithmatic Functions i.e. ABS, LOG etc. in Mload Script. * MultiLoad doesn’t support Exponentiation and Aggregator Operators i.e. AVG, SUM etc. in Mload Script. * MultiLoad doesn’t support USIs (Unique Secondary Indexes), Refrential Integrity, Join Indexes, Hash Indexes and Triggers. * Import task require use of PI (Primary Index). Q.What are TPUMP Utility Limitations? Answer: Following are the limitations of Teradata TPUMP Utility: * Use of SELECT statement is not allowed. * Concatenation of Data Files is not supported. * Exponential & Aggregate Operators are not allowed. * Arithmatic functions are not supported. Q.Explain Teradata Competetive Advantages detail? Answer: 1.Automatic,Even Data Distribution In Teradata Even uniform or Parallel or Random distrubution is Automatic. 2.High Scalability If you Increase the number of Nodes or users or work teradata does not sacrifice any performance and it scales Linearly.we are calling this scalabilty as linear scalability. 3.Mature Optimizer As we are having powerful optimiser in teradata it supports 64 joins/query, 64 subquery/query, formating commands and aggregate commands. 4.Models the business Teradata supports any business models for star,snowflake schema,Hybrid schema,Normalisation etc 5.Low Cost Tco(Total cost of Ownership) Its easy to Install,Manage,work. Full Support Gui and Cheaper in price. Q.How do you set the session mode parameters in BTEQ? Answer: .set session transaction ANSI-----> this is to set ANSI mode .set session transaction BTET-----> this is to set Teradata transaction mode. These commands have to be entered before logging to the session. Q.How Teradata makes sure that there are no duplicate rows being inserted when its a SET table?Answer: Teradata will redirect the new inserted row as per its PI to the target AMP (on the basis of its row hash value), and if it find same row hash value in that AMP (hash synonyms) then it start comparing the whole row, and find out if duplicate.If it’s a duplicate it silently skips it without throwing any error. Q.List types of HASH functions used in teradata?Answer: There are HASHROW, HASHBUCKET, HASHAMP and HASHBAKAMP.The SQL hash functions are: HASHROW (column(s)) HASHBUCKET (hashrow) HASHAMP (hashbucket) HASHBAKAMP (hashbucket) Q.what is derived table? Answer: 1..It stores intermediate results and calculations. 2.You can specify derived table in an SQL statement(Preferrably Select).3.This table created and dropped as a part of the query. 4.Its stored under spool memory. 5.Once the query finishes execution table is not available. 6.This types of tables are called as Inline Query tables.Q.What is Journaling? why teradata requires journaling? Answer: Journling is a data protection mechanism in teradata.It prevents from Image failure. Journals are generated to maintain pre-images and post images of a DML transaction starting/ending, at/from a checkpoint. When a DML transaction fails,the table is restored back to the last available checkpoint using the journal Images. There are 3 types of Journals Permanent Transient Recovery Journal Q.How many types of Image supports Journaling? Answer: Four types of Images are supported by Journaling they are a. Single Image---->one copy of data will be taken. b.Dual Image----->Two copies of data will be taken. c.Before Image----->Before changes occur on the row data will be taken. d.After Image---->After changes happen on the row taking copy of data. Q.What is Transient Journal? Answer: Transient Journal - an area of space in the DBC database which is used primarily for storing of roll-back information during inserts/deletes/updates of tables. Detail Explanation: The Transient Journal maintains a copy of all before images of all rows affected by the transaction. If the event of transaction failure, the before images are reapplied to the affected tables, the images are deleted from the journal and a rollback operation is completed. In the event of transaction success, at the point of transaction commit, the before images for the transaction are discarded from the journal. In Summary, if a Transaction fails (for whatever reason), the before images in the transient journal are used to return the data (in the tables involved in the transaction) to its original state. Q.What is Permanent Journal? Answer: The purpose of the Permanent Journal is to provide selective or full database recovery to a specified point in time.It permits recovery from unexpected hardware or software disasters. The Permanent Journal also has the effect of reducing the need for full table backups which can be costly both in time and resource. Q.What are the different image options for Permanent Journal? Answer: There are four image options for the Permanent Journal: · Before Journal · After Journal · Dual Before Journal · Dual After Journal Q.Permanent Journals are automatically purged in teradata. True / False? Answer: False. The Permanent Journal must be manually purged from time to time. Q.Where does TD store transient journal? Answer: In perm space -> dbc.transient journal But that special table can grow over dbc's perm limit, until the whole system runs out of perm space. Q.What are the different return codes(severity errors) in Teradata utilities? Answer: There are 3 basic return codes (severity errors) in teradata utilities. 4 - Warning 8 - User error 12 - System error Q.How will you connect a database server to other server? Answer: We can connect from one server to another server in UNIX using the command ssh or FTP or SU ssh - ssh user_id@server_name Q.What is the meaning of skewness in Teradata? Answer: Data or Amp skew occurs in teradata due to uneven distribution of data across all the amps. Often this leads to spool space error too. To avoid skewness try to select a Primary Index which has as many unique values as possible. PI columns like month day etc. will have very few unique values. So during data distribution only a few amps will hold all the data resulting in skew. If a column (or a combination of columns) is chosen a PI which enforces uniqueness on the table then the data distribution will be even and the data will not be skewed. Q.Does Primary Index column choosing is important? Answer: The Success of teradata warehouse starts with choosing of correct column to creation of Primary index. Try to Choose a column which has unique values. so that data can be distributed evenly among all AMPs. Otherwise Skewness will come into picture. Primary index is useful to get a straight path to retrieve data. Q.What are the basic rules that define how PI is defined in Teradata? Answer: These are the following rules how Primary Index defined in Teradata a. Only one Primary Index per table. b.Its a physical mechanism which assigns Row to AMP. c.A Primary Index value can be unique or non-unique. d.A Primary Index can be composite till 64 columns. e.The Primary Index of a populated table cannot be modified. Q.What are the basic criteria to select Primary Index column for a given table?Answer: A thumb rule of ADV demographics is followed. Access Demographics Identify index candidates that maximize one-AMP operations. Columns most frequently used for access (Value and Join). Distribution Demographics Identify index candidates that optimize parallel processing. Columns that provide good distribution. Volatility Demographics Identify index candidates with low maintenance I/O. Q.Can you explain and PE and AMPs communicate? Answer: When user connects to teradata database he opened a session with parsing Engine(PE) there after when user submits a query, 1. First PE takes query, checks syntax, verifies his access rigthts 2. If every thing looks okay, PE prepare an action plan for AMP. Which AMP to respond , what is row ID to read ...etc 3. Then PE sends action plan to AMP via BYNET signals. 4. Then corresponding AMP takes action plan and reads data and send back to PE Then PE sends data to USER Q.Does Permanent Journals and Secondary indexes require Perm Space? Answer: Yes Q.Which objects require Perm space in teradata? Answer: Tables and Stored Procedures objects require Perm Space Views, Macros , Triggers doesn't require Perm space. Q.What is LOG TABLE?Answer: A log table maintains record of all checkpoints related to the load job, it is essential/mandatory to specify a log table in a job. This table will be useful in case you have a job abort or restart due to any reason. Q.what is the use of Partition? Answer: If you create PPI on table, then data at AMPs is ordered using Partition column. Example If we have Partition on deptno then all dept 10 records at one place at AMP and dept 20 records at one place. Q.Is it possible to alter NO RANGE and Unknown partition on a table? Answer: Yes if table is empty,we can alter NO RANGE and Unknown Partition of a table. Q.Can you apply a read lock on table where write lock is already applied? Answer: No Q.How teradata makes sure that there are no duplicate rows being inserted when its a SET table? Answer: Teradata will redirect the new inserted row as per its PI to the target AMP (on the basis of its row hash value), and if it find same row hash value in that AMP (hash synonyms) then it start comparing the whole row, and find out if duplicate. If its a duplicate it silently skips it without throwing any error. Q.Explain What are Low and High Confidentiality in Explain command? Answer: The explain generally displays the plan which would be prepared by Optimizer. Confidence levels indicate how well the optimizer is aware about the demographic data for a particular step. High confidence: Means the optimizer know about the no of rows that would be returned as a result of that step. Examples would be when PI statistics exist when the column or range stats exist or no join involved. Low confidence: Some stats available. Join and stats available on both sides of the join. No confidence: No stats available. Join involved. Q.Why Fastload Utility does not support multiset table and duplication? Answer: Multi set table supports duplicate rows. Fastload utility does not support duplicate rows. Restart logic is one of the reason. Fastload is one of the utility which loads data in blocks(64K). During restart Fastload sends some of the rows for the second time which occur after a checkpoint. Due to this Fastload rejects duplicates. example consider 20 rows to be loaded in a table. FL utility is used to load the table and ckpt is 5 rows. If restart occurs after 7th row FL may send 6 and 7th to AMPs during the restart. This records would be considered as duplicates and rejected. .Q.Can you Recover the password of a User in Teradata? Answer: No , you cant recover the password of a User in Teradata. Passwords are stored in Data Dictionary table (DBC.DBASE).Using a one way encryption method. You can view the encrypted passwords using the following query. SEL * FROM DBC.DBASE; Q.what is the differnce between Sub Query and Corelated Sub Query? Answer: Sub Query If Queries written in a nested manner then its termed as a sub query. Inner query executed First and executed Only one time. Corelated Sub Query Co-Related Sub query get executed once for each row of the parent query. Inner Query executed many based on Outer query. Q.what is FILLER command in Teradata? Answer: while running Fastload or Multiload if you dont want to load a particular field from the datafile to the target table then use the FILLER command to achieve. Q.Difference between Access Logging and Query Logging in Teradata? Answer: 1.Access Logging is concerned with security ( who is doing what) In access logging you ask the database to log who is doing what on a given object. The information stored is based on the object not the SQL fired or the user who fired it. 2.Query Logging (DBQL) is used for debugging (what is happening around) Incase of DBQL database keep on tracking various parameters like SQLs, Resource, Spool Usage and other things which help you to understand whats going on, the information is fruitful to debug a problem. Q.What is basic Teradata Query language? Answer: 1.It allows us to write SQL statements along with BTEQ commands. we can use BTEQ for Importing,Exporting and Reporting Purposes. 2.The commands start with a (.) dot can be terminated by using (;) it is not mandatory to use(;) 3.BTEQ will assume any thing written with out dot as a SQL statement and requeries a (;) to terminate it. Q.How can you track Login parameters of users in teradata? Answer: You can view all these parameters in this Data Dictionary Table DBC.LOGONOFF //SELECT LOGDATE,LOGTIME,USERNAME,EVENT FROM DBC.LOGONOFF;// Q.How can you use HASH FUNCTIONS to View Data Distribution across all AMPs in Teradata?Answer: Hash Functions can be used to view the data distribution of rows for a chosen Primary Index. SELECT HASHAMP(HASHBUCKET(HASHROW())) AS "AMP#",COUNT(*) FROM GROUP BY 1 ORDER BY 2 DESC; HASHROW --- returns the row hash value for a given value. HASHBUCKET --- the grouping of a specific Hash value. HASHAMP-----the AMP that is associated with the Hash Bucket. This is realy good, by looking into the result set of above written query you can easily find out the Data distribution across all AMPs in your system and further you can easily identify un-even data distribution. Q.How do you transfer large Amount of Data in Teradata? Answer: Transfering of large Amount of data can be done using various Applications like Teradata Utilities. BTEQ,FastLoad,MultiLoad,Tpump and FastExport BTEQ (Basic Teradata Query) supports all 4 DML s : SELECT, INSERT,UPDATE and DELETE. BTEQ also support IMPORT/EXPORT protocols. Fastload ,Multiload and Tpump transfer the data from Host to Teradata. FastExport is used to export data from Teradata to the Host. Q.How can you determine I/O and CPU usage at a user level in Teradata? Answer: You can find out I/O and CPU Usage from this data Dictionary Table DBC.AMPUSAGE; SELECT ACCOUNTNAME,USERNAME,SUM(CPUTIME) AS CPU,SUM (DISKIO) AS DISKIO FROM DBC.AMPUSAGE GROUP BY 1,2 ORDER BY 3 DESC; Q.What is Normalization? Answer: Normalization is the process of reducing a complex data structure into a simple, stable one. Generally this process involves removing redundant attributes, keys, and relationships from the conceptual data model. Q.How many types of Indexes are present in teradata?Answer: There are 5 different indexes present in Teradata Primary Index a.Unique primary index Non Unique primary index Secondary Index Unique Secondary index non Unique Secondary index Partitioned Primary Index Case partition Range partition Join index Single table join index multiple table join index Sparse Join index ( constraint applied on join index in where clause) Hash index Q.what teradata supports 6A's means?Answer: Active Load Active Access Active Events Active Workload Management Active Enterprise Integration Active Availability. Q.Which is Faster – MultiLoad Delete or Normal Delete command? Answer: MultiLoad delete is faster then normal Delete command, since the deletion happens in data blocks of 64Kbytes, where as delete command deletes data row by row.Transient journal maintains entries only for Delete command since Teradata utilities doesn’t support Transient journal loading. Q.what tools can be used for Active Load in teradata? Answer: ETL Tools can use queue tables and triggers, and use FastLoad, MultiLoad and TPump utilities . Q.How to Skip or Get first and Last Record from Flat File through MultiLoad? Answer: In .IMPORT command in Mload we have a option to give record no. from which processing should begin. i.e. ‘FROM m’ ‘m’ is a logical record number, as an integer, of the record in the identified data source where processing is to begin. You can mention ’m’ as 2 and processing will start from second record.THRU k and FOR n are two options in the same Mload command, functions same towards the end of the processing.Adding to the above, if from n"start record" and for n "stop record" are not mentioned, mload considers records from start till the end of the file. Q.what is the use of TEMP space?Answer: TEMPORARY (TEMP) space : A database may or may not have TEMP space, however, it is required if Global Temporary Tables are used.
Continue reading
Tibco Interview Questions
Q.How do you configure client for fault tolerant connection? Answer: . Specify multiple server as a comma-separated list of URLs and both URLs must use same protocol either tcp or ssl. Q.What are the different types of acknowledgement modes in EMS message delivery Answer: Auto • Client • Dups_ok • No_ack • Explciit • Explicit_client_dups_ok • Transitional • Local transitional. Q.What are the different types of messages that can be used in EMS ? Answer: • Text • Simple • Bytes • Map • XML test • Object • Object ref • Stream Q.Tell me about bridges. Why do we use them, Syntax to create bridges, use of message selector ? Answer: Some applications require the same message to be sent to more than one destination possibly of different types. So we use bridges. Q.why do we need routers ? Answer: . To transfer messages between different ems servers. Q.What is the purpose for stores.conf ? Answer: This file defines the locations either store files or a database, where the EMS server will store messages or metadata. b. Each store configured is either a file-based or a database store. Q.How many modes are the messages written to store file. Answer: 2 modes.. sync or async. When absent , the default is async Q.What is tibemsd.conf? Answer: It is the main configuration file that controls the characteristics of the EMS server Q.How many delivery modes for messages? Answer: Persistent, Non-persistent, Reliable-delivery. Q.What is the maximum messagesize? Answer: Ems supports max message size of 512 mb Q.Name 3 destination properties and explain them. Answer: Global, secure, maxmsgs, maxbytes, flowcontrol, sender_name, sender_name_enforced, trace,maxRedelivery Q.What are the different modes of installation in Ems? Answer: a. GUI mode b. Console mode c. Silent mode Q.What are the messaging models supported by JMS? Answer: a. Point-to-point b. Publish-subscribe c. Multicast Q.Tell me about routes Answer: What is the purpose of routes, what kind of destinations can be used in routes? Topics and queues m-hops Q.What happens if the message expires/exceeded the value specified by maxredelivery property on queue? Answer: If the jms_preserve_undelivered property is set to true, then it moves he message to undelivered message queue, if set to false, the message is deleted by the server. Q.In how many ways can a destination be created? Answer: a. Static-created by user b. Dynamic-created by ems server on the fly. c. Temporary destinations. Q.Tell me about flow control on destinations Answer: Some times the producer may send messages faster than the consumers can receive them. So, the message capacity on the server will be exhausted. So we use flow control. Flow control can be specified on destinations. Q.Tell me about flow control on bridges and routes Answer: Flow control has to be specified on both sides of bridges where as on routes it operates differently on sender side and receiver side. Get practical explanation on Flow Control at Tibco Online Training Q.Name 3 configuration files and tell me what it consists of Answer: a. Queues.conf b. Topics.conf c. Routes.conf d. Factories.conf e. Stores.conf f. Groups.conf,users.conf,transports.conf Q.Name some administrative level destination properties a. View b. Create c. Delete d. Modify e. Purge Q.How can you change the configuration properties of EMS server Answer: You can change in the tibemsd.conf file or you can change using the ems admin console. Q.What are the permissions that you can grant to users to access queues Answer: a. Receive b. Send c. Browse Q.What are the permissions that you can grant to users to access topics Answer: a. Subscribe b. Publish c. Durable d. Use_durable Q.Tell me about multicasting in EMS Answer: a. Multicast is a messaging model that broadcasts messages to many consumers at once rather than sending messages individually to each consumer. EMS uses Pragmatic general multicast to broadcast messages published to multicast enabled topics. b. Each multicast enabled topic is associated with a channel. Q.What is the default maximum size of message? Answer: . 512mb Q.What is the transition in BW? Answer: to move the data from one activity to another or when conditions exist on the data. Q.What are the different kinds of condition types you can have in transition? Explain Them   Answer: a)Success :- Take this transition unconditionally. That is, always transition to the activity the transition points to, if the activity completes successfully. This is the default condition for transitions. b)Success with condition :- Specify a custom condition using XPath. If the activity completes successfully, and the condition you create evaluates to true, the transition is taken to the activity it points to. You can type in an XPath condition, and you can use the XPath formula builder to drag and drop XPath expressions and data into the condition. c)Success if no matching condition :- Take this transition when the activity completes successfully, but only if no other transition s are taken. This is useful when multiple transition s with conditions are drawn to other activities. This condition type can be used to handle any cases not handled by the conditions on the other transition s. d)Error :- Take this transition if there is an error during processing of the activity. Q.What is Generate Error activity? What the applications of it ? Answer:This activity generates an error and causes an immediate transition to any error transitions. If there are no error transitions, the process instance halts execution. This activity is useful in a group or in a called process. If you would like to catch and raise your own error conditions, you can use this activity to do so. Q.What are the shared variables and process variables? Answer: Process variables: Process variables are data structures available to the activities in the process. You can define your own process variables and assign values to them in your process definition. Process variables are defined on the Process Variables tab of the Process Definition resource. And assigning values to these kind of variables is done using the assign activity. Shared variables: A Shared Variable resource allows you to share data across process instances. All process instances can read and update the data stored in a shared variable. This type of shared variable is useful if you wish to pass data across process instances or if you wish to make a common set of information available to all process instances. Get more differentiation on shared variable and process variables at Tibco Training Q.What is XPATH? Answer: isa XML based path language used to navigate the XML document and manipulate the data Q.What is XSD? Answer: XML schema definition. Q.What is name space in XSD? Answer: A name conflict will occur when two different documents use the same element names. So each element is given a unique namespace. Q.What is aweb-service? Answer: application or a network responding to some remote web-request. Q.What is a wsdl?what are different types of wsdl? Answer:Web-services run on wsdl,it defines structure of schema.There are two types of wsdl Abstract wsdl:-Used on server side,contains request,response and type of operation performed. concrete wsdl:-used on client side,contains abstract wsdl and tr port used. Q.In how any ways can we create EAR files ? Answer: We can build EAR files in 2 ways. One method is using the Enterprise Archive pallete in the Tibco designer and adding the process archive we can build EAR file. Other method is from the cmd prompt. We use the command appmanage and buildear. Q.What is Schema and why do we create schema ? Answer: Schema is used to create a XML schema file in which we add the variables which we want to use in our designer process. We can create the elements under which we can add the typed variables. The structure formed will be in the form of tree structure? Q.What is the use of confirm activity. Answer: Confirm activity is used to confirm the success of a activity that have confirmable messages. For example if certain process starts on reception of a message then if that process starts the confirm activity sends a confirm message to the sender of the message. Q.What are the different modes of tibco BW installation? Answer: There are 3 modes installation . a) GUI b) console c) silent Q.When we save a project what files are created under the saved project folder? Answer: In that folder we see the Aeschema folder, all processes create in that project and the vcrepo.dat file. Q.What are the contents of vcrepo.dat? Answer: This files contains the display name , RV encodings and description. Q.What is grouping activity? Answer: Grouping activity is used to group certain activities used in the designer so that we can loop those activities and iterate the group with conditions. Q.What is the condition for a process in order to build the EAR file? Answer: We need to have a process starter. Q.How can we design exception handling? Answer: The basic method is by routing the process to another sub process whenever error occurs by using the error transition . Q.What is the use of Render XML pallet? Answer: It is used to create a XML file by creating the tags used in the xml file. Q.What are the elements in the WSDL file? Answer: In abstract WSDl file we have the information about the messages ( request, reply) port type, operation. In concrete in addition to above we have the trAnswer: port information. Q.What is the use of global variables? Answer: Global variables are useful in order to provide dynamic input at the run time. Q.What is custom activity? Answer: Custom activity is useful when we want others to use our process and not allow them to view the contents of the process. We can add this process into our MY Pallete section. We can directly use this process by dragging it into our process. Q.In web service in how many ways can we create connections? Answer: 2 ways. HTTP and JMS. Q.How does the file poller activity works ? Answer: This is a starter activity which starts the process whenever there is update for the file that is specified. Q.What is the use of project template? Answer: In project template we can save our standard processes that we want to use in future. These processes are generally the ones which have the commonly used activities. Q.What is the optimum maximum number of connections in JDBC? Answer: 10 Q.What is sub process and what is its use? Answer: Whenever we call a process from another process , the called processbecomes the subprocess. Sub process helps in reducing the complexity of the design by assigning the activity in other process. Q.What is TRA? Answer: Tibco Runtime Agent is the main frame work for all the tibco softwares. It provides the runtime agent and monitoring agent. All the libraries required by the designs are provided by TRA. Q.What are the process variables that are available to all activities as inputs? Answer: global variables and process context. Q.What are break points? Answer: Break points are used to check the inputs and outputs of each activity during testing the design so that we can debug our design. We can place the check points for all activities on input and output side. Q.What re the encoding techniques in WSDL? Answer: Encoded and literal. Q.What are the conditions in transition s? Answer: Success, Success with condition, Success with no condition and error. Q.What are the different variables in BW? Answer: Global variables, Shared variables, process variables and job shared variables. Q.Explain the processes flow of your latest project by including activities. Answer: for various service implementation I designed processes using various BW activities like JMS Queue Receiver, XML Parser Q.What are the differences between the versions 2.x and 5.2? Answer: Type 2.x 5.2 deployment In 2.x deployment is done in the designer Here deployment is done using administrator tool Name spaces Name spaces are prefixed with tib No tib in name spaces pallets Extra pallets added Iterate-reset At the end of each iterate the output is not reset Here the output is reset after each iteration installation Here all the components like BW,ADB,FILE etc come as a package Here we will have to install each component separately File type All files are .dat. we have to convert these manually Multi format file are available so that we do not have to convert the dat files Get more variation from real time experts at Tibco Course Q.What are the activities you worked on? Answer: JMS queue receiver,confirm,checkpoint,XML parser,call process,JDBC update,SOAP,HTTP. write to log - widely used. Assign. Q.What is the inspector activity does? Answer: the inspector activity is used to obtain the output of any activity or all the activities and process variables. Scenario: You can use the inspector activity to write the output of any activity or process variable in the current process. Activities and process variables in a subprocess are not available to the Inspector activity (but the output of a Call Process activity can be written using the Inspector activity). If you wish to obtain the output from one or more activities or process variables in a subprocess, place the Inspector activity in the process definition of the subprocess. Q.Can you tell me at least four starter activities and when they get executed? Answer: 1) Adapter subscriber, adap request response server :-When ever a message comes into a destination queue or network, listerns to a request from a adapter and sends a respose back to that adapter 2) File poller :- polls for any changes that occur in the file and with any change grabs the whole file. 3) Timer:- receive notification Starts a process on the time specified ex: before JDBC QUERY, starts a process on receiving of data from a notify activity. 4) http receiver :-Starts a process based on a request from http server. 5) Jms queue receiver, jms topic subscriber :- Starts a process when ever a new message comes into the specified queue, starts a process when ever there is a new message in the specified topic. 6) Receive mail :- The Receive Mail process starter polls a POP3 mail server for new mail. When new mail is detected and retrieved, the Receive Mail process starter starts a new process for the process definition it resides in and passes the mail data to the next activity in the process flow. 7) Rv subscriber:- The Rendezvous Subscriber process starter creates a process when a TIBCO Rendezvous message on the given subject is received. Cool Rmi server :- The RMI Server process starter registers the specified remote object name with the specified registry server and then creates process instances to handle incoming requests for the object. The process definition acts as the implementation of the specified object. 9) Soap event source :- The SOAP Event Source process starter creates a process instance for incoming SOAP requests. SOAP is a standard protocol for invoking web services. This allows you to create a web service using process definitions. At runtime, a client can retrieve the WSDL file for a process containing this process starter using an HTTP request. Once the WSDL is retrieved, the client can perform a SOAP request to invoke the web service. 10)Tcp receiver The TCP Receiver process starter starts a new process when a client requests a TCP connection. Q.What is the purpose of JMS Queue receiver activity and Queue sender activity? Answer: Starts a process when ever a new message comes into the specified queue. A queue sender activity sends messages into the specified queue. Q.What are acknowledgement modes and where do you set them and what is the applicability of each mode? Answer: The acknowledge mode for incoming messages. Can be one of the following: • Auto — the message is automatically acknowledged when it is received. • Client — the message will be acknowledged at a later point by using the Confirm activity. If the message is not confirmed before the process instance ends, the message is redelivered and a new process instance is created to handle the new incoming message. Ensure that your process definition confirms the message when using this acknowledge mode. • TIBCO EMS Explicit Client Acknowledge — this mode behaves exactly the same as the Client mode, except the session is not blocked and one session can handle all incoming messages. • Dups OK — the message is acknowledged automatically when it is received. JMS provides this mode for lazy acknowledgement, but TIBCO BusinessWorks acknowledges messages upon receipt. • Transactional — this mode is used when a Transction that can process JMS messages is included in the process definition. The message is acknowledged when the Transction commits. See TIBCO BusinessWorks Process Design Guide for more information about creating transction JMS activities can participate in. Q.What is a check point activity and confirm activity? Answer: A checkpoint saves the current process data and state so that it can be recovered at a later time in the event of a failure. If a process engine fails, all process instances can be recovered and resume execution at the location of their last checkpoint in the process definition. The Confirm activity confirms any confirmable messages received by the process instance. For example, if a process is started because of the receipt of an RVCM message, the Confirm activity can send a confirmation message to the publisher of the RVCM message Q.What happens if you use check point activity first and confirm next? Answer: In the case of confirmable messages , you must consider the consequences of performing a checkpoint before or after a Confirm activity. If the checkpoint is taken before the Confirm activity, then a crash occurs after a checkpoint but before a confirm, the original message is resent. In this case, the restarted process can no longer send the confirmation. However, a new process is started to handle the resent message, and you can implement your process to handle the restarted and new processes appropriately. If the checkpoint is taken after a Confirm activity, there is potential for a crash to occur after the Confirm but before the checkpoint. In this case, the message is confirmed and therefore not redelivered. The process instance is not restarted, because the crash occurred before the checkpoint. You must consider the type of processing your process definition performs to determine when a checkpoint is appropriate if your process definition receives confirmable messages. Tibco Interview Questions Tibco Interview Questions and Answers Q.What is the potential problem with JMS Queue requestor? Answer: When we specify a reply to queue there is a chance of other processes sending messages to the same queue and the jms queue requestor interpreting that as the actual response and sending this wrong message to the client. Q.What are the modes of installation of tibco bw applns? a. GUI mode b. Console c. Silent mode Q.What is the thread count in tibco administrator? Answer: 8-32 threads for BW engines b. For http connections-10/75 c. JMS-it’s a single thread model Q.What is Max jobs, Flow limit, and Activation limit? Answer: a. Max jobs: Max jobs specify the number of process instances that are kept in memory while executing. b. Flow Limit: it is the max number of jobs that can be spawned before the process starter is suspended. c. Activation Limit: Specifies that once a process instance is loaded and it must be placed in memory till it completes execution. Q.What are tibco best practices-users? Answer: a. User ”tibco” should be master of all applications. b. User “tibcou” should have read only access to tibco applications and have read write access to logs owned by developer groups. Q.In how many ways can you deploy EAR file? Answer: a. Using tibco administrator GUI b. Use Appmanage utility to deploy EAR file into targeted domains Q.Whether to use check point/confirm activity first? Answer: BW doc/general palette/checkpoint/confirm Q.What information can be found in the appmanage.batch file? Answer: Component paths, service instance paths that is required to run AppManage utility Q.What is difference between EMS and RV? Answer: Ems is centralized where as RV is bus based Q.What are the 2 message port types? Answer: Ems has 2 message port types Point-to-point: queues ii. Pub-sub : topics Q.What is the diff between tibco adapter and BW component? Answer: a. Adapters are connectors that use a messaging channel that can be configured over source/target systems which can be used in Pub,Sub or Replyrequest mode. BW components are designer, administrator, bw engine. Q.Why are routes used? Answer: When we have to send messages from one server to another. Q.If you have installed a particular version of TIBCO software e.g. TIBCO BW X.Y.Z, What are X, Y and Z number stands for? Answer: Integration can be at different application layers: X:Patch Y:Major Z:Minor Q.What is the role of TRA? Answer: TRA stands for TIBCO Runtime Agent. The TRA has two main functions: Supplies an agent that is running in the background on each machine. The agent is responsible for starting and stopping processes that run on a machine according to the deployment information. The agent monitors the machine. That information is then visible via TIBCO Administrator. Supplies the run-time environment, that is, all shared libraries including third-party libraries. Q.What are the resources that gets included in the EAR file, created by the TIBCO Designer? Answer: An EAR file can contain local project resources, LibraryBuilder resources, and files as specified in AliasLibrary resources. In addition, the TIBCO Designer classpath may include references to other files that are included in the EAR file. Q.What are the revision control system options available in TIBCO designer? Answer: File sharing VSS Perforce XML Canon ClearCase iPlanet CVS PVCS Q. What are the different modes of service invocation? Answer: Services can be invoked in several ways. A one-way operation is executed once and does not wait for a response. A request-response operation is executed once and waits for one response. In a request-response service, communication flows in both directions. The complete interaction consists of two point-to-point messages—a request and a response. The interaction is only considered complete after the response has arrived. Publication (notification) means an operation sends information on an as-needed basis, potentially multiple times. Subscription means incoming information is processed on an as-needed basis, potentially multiple times. Q.What is vcrepo.dat? Answer: TIBCO Designer creates a file named vcrepo.dat in the project root directory when you first save the project. This file is used to store properties such as display name, TIBCO Rendezvous encoding, and description. This file can be used for identification in place of the project root directory and can be used as the repository locator string (repoUrl). Q.What are the TIBCO BW activities that can participate in transactions? Answer: Not all TIBCO BusinessWorks activities can participate in a transaction. Only the following types of activities have transactional capabilities: JDBC activities JMS activities ActiveEnterprise Adapter activities that use JMS transports EJB activities TIBCO iProcess Business Works Connector activities Q.What are the different types of Transactions TIBCO provides? Answer: TIBCO BusinessWorks offers a variety of types of transactions that can be used in different situations. You can use the type of transaction that suits the needs of your integration project. When you create a transaction group, you must specify the type of transaction. TIBCO Business Works supports the following types of transactions: JDBC Java Transaction API (JTA) UserTransaction XA Transaction Q.What are the possible Error output's of Read File activity? What activities are supported in JTA Transaction? Answer: The Java Transaction API (JTA) UserTransaction type allows: JDBC JMS ActiveEnterprise Adapter (using JMS transports) EJB activities to participate in transactions Q.What activities are supported in XA Transaction ? Answer: The XA Transaction type allows: JDBC activities ActiveEnterprise Adapter activities that use the JMS transport JMS activities to participate in transactions. Q.What are the possible Error output's of Read File activity? Answer: Integration can be at different application layers: FileNotFoundException :Thrown when yhe file does not exist. UnsupportedEncodingException:Thrown when the text file’s encoding is not valid and the content of the file is read into process data. FileIOException :Thrown when an I/O exception occurred when trying to read the file. Q.What is the purpose of the inspector activity ? Answer: The Inspector activity is used to write the output of any or all activities and process variables to a file and/or stdout. This is particularly useful when debugging process definitions and you wish to see the entire schema instead of mapping specific elements to the Write File activity. Q.What are the maximum/minimum of threads available for incoming HTTP ? Answer: The maximum/minimum of threads available for incoming HTTP : 75/10 Q.How can unauthorized users be prevented from triggering a process ? Unauthorized users be prevented from triggering a process by giving 'write' access for the process engine to only selected users. Only users with 'write' access can do activities like deploying applications, starting/stopping process engines etc. Q.What are the mandatory configuration parameters for FTP Connection & FTP with firewall ? Answer: The mandatory configuration parameters for FTP Connection FTP host Port Username & Password> If Firewall is enabled in addition the proxy host and port are required. Q.How to use legacy .dat file format with latest designer ? Answer: Convert .dat file to multi file project using Administration tab while starting up Designer(Other one being Project tab) and then open the multifile project in the normal way. Q.What are the encodings supported by designer ? Answer: Encodings supported by designer are ISO8859-1(Latin-1) UTF-8 Q.What are the 4 main panels of the Designer window ? Answer: The 4 main panels of the Designer window are Project panel Palette panel Design panel Configuration panel Q.How do you determine if there are broken references in the project? Project -> Validate for deployment Q.Where are the Designer preferences stored ? Answer: Designer preferences stored are stores in a file called 'Designer .prefs' in the user home directory. Q.What are the options for configuring storage for process engine's checkpoint repository ? Answer: The options for configuring storage for process engine's checkpoint repository are: Local File Database. Fault tolerant engines can recover from a checkpoint only when database is used. Q. Process engines in a fault tolerant group can be configured as peers or master secondary.How do these differ ? Answer: The options for configuring storage for process engine's checkpoint repository are: - Peer means all of them have the same weight. In this case when one engine fails another one takes over and continues processing till it fails. - In master secondary configuration weights are unequal, the secondary starts processing when master fails. But when master recovers, secondary stops and master continues processing. Q.What are the uses of grouping activities ? Answer: Uses of grouping activities are: Create a set of activities having a common error transition. Repeat group of activities based on a condition. - Iterate over a list. - Repeat until condition true. - Repeat on Error until condition true. Group activities into a transaction. To create a critical section area that synchronizes process instances. A 'Pick First Group' allows you to wait for the occurence of multiple events and proceed along a path following the first event to occur. Q.What is the purpose of a Lock shared configuration resource? Answer: A Lock is specified for a 'Critical Section' group when the scope is 'Multiple'. It can be used to ensure synchronization across process instances belonging to multiple processs definitions or for process instances across engines(Check multi engine flag for lock in this case and the BW engine needs to be configured with database persistence while deployment). If synchronization is for process instances belonging to the same processs definition inside one engine, just specify the scope as 'Single'. Get more explanation at Tibco Course Online Q.How to control the sequence of execution of process instances created by a process starter ? Answer: Use the sequencing key field in the Misc tab of any process starter. Process instances with the same value for this field are executed in the sequence in which they are started. Q.Can there be two error transitions out of an activity ? Answer: No. There can be only one Error and one Success if no matching condition transition out of each activity. Q.When is a 'No Action' group used ? Answer: 'No Action' group used to have a set of activities having a common error transition Q.What activity can be used to set the value of a 'User defined process variable' ? Answer: The 'Assign' activity can be used to set the value of a 'User defined process variable'. Q.Which mechanism can be used to pass data between a process instance and a called sub process other than mapping from/to the callee's input/output ? Answer: This can be accomplished using job shared variables, unless in the call process activity the 'Spawn' flag is enabled in which case the called sub process is a new job and hence gets a fresh copy of the job shared variable initialized as per its configuration. A shared variable can overcome this limitation as it's scope is not limited to one job. Q.What are the three scenarios where BW engine has to be configured with database persistence instead of Local File ? Answer: The three scenarios are: Shared Variables across BW engines. Locking across groups in multiple BW engines. Wait Notify across BW engines. Q.If you want a group to be executed if there is some unhandled error but subject to some max number of iterations which group do you use ? Answer: We can use Repeat on Error until true Q.When is a 'Generate Error' activity useful? Answer: When you handle an error inside a called subprocess or group and want to rethrow the error to the caller(happens by default if you dont handle the error in the called process) Q.Which activity is used for detecting duplicate message processing? Answer: CheckPoint activity - Specify the uniqueID for the duplicate key field and engine maintains list of these key fields. When a process come to checkpoint activity with the same value for duplicate key which already exists, it throws a DuplicateException. An error transition can then handle this case. Q.Give an example where graceful migration of service from one machine to another is not possible. Answer: HTTP Receiver. In this case the receiver on new machine starts listening on the same port, but you need to redirect requests from the old machine to the new one. Q.What are the types of adapter services ? Answer: Types of adapter services are : Subscriber Service Publisher Service Request-Response Service Request-Response Invocation Service Q.If the business process needs to invoke another web service which resource do you use ? Answer: SOAP request reply activity. If the business process needs to be exposed as SOAP service use SOAP Event Source in conjunction with SOAP Send Reply or SOAP Send Fault. Q.What is the functionality of the Retrieve Resources resource? Answer: It can be used to serve the wsdl file of a SOAP Event Source to a (http) client. Construct a process like: HTTP Receiver -> Retrieve Resources -> Send HTTP Response Now the WSDL file for a SOAP service can be retreived using the http request http://://?wsdl where 'path' is the folder path to the SOAP Event Source process and 'resourceName' is the name of the process Example : http://purch:8877/Purchasing/GetPurchaseOrder?wsdl Q.What is the scope of user defined process variables ? Answer: The scope of user defined process variables is only the process in which it is defined.(Not even inside a sub process that is invoked from this process) Q.What is difference between shared variable and job shared variable ? Both of them can be manipulated via the palette resources 'Get shared variable' and 'Set shared variable'. A job shared variable is private to one instance of job or in other words each job has a fresh copy. In the case of shared variable the same copy is shared across all job instances. It can even be persisted and can survive BW engine restarts and even shared across multiple BW engines(when deployed using DB persistence). Q.How do wait-notify resources work ? Answer: Basically wait and notify should share a common notification configuration which is just a schema definition for data that will be passed from notifier to waiter. Specific instances of waiter & notifier are corrrelated via a key. For example: when one process is in wait state for key 'Order-1', it waits till another process issues a notification with the same key value. Q.What is the default Axis in XPath ? Child axis- What this means is that when you select "BOOK" from the current context, it selects a child node with that name, not a sibling with that name. Other axes are parent , self , sibling etc. Q.What are the output formats for XSLT? XML HTML Text Q.What is the other JMS activity you would use to address the problem? Answer: Get jms queue message. This incorporates a message ID to identify the true message. Q.What are the activities you used to publish the messages from BW process to network on RV? Answer: publish rv message activity (along with the shared resource – rv transport). Q.What does ' Success if no matching condition' transition mean ? Answer: Lets say between two nodes N1 and N2, there are 3 success transitions with condition and there is no success transition without condition. If none of the conditions match then a 'Success if no matching condition' transition can be used. Also if there is a success transition and also success transitions with condition and if the condition matches then both the sucess transition (no condition) as well as the transition(s) with matching conditions are followed. So you can use 'Success if no matching condition' to prevent duplicate paths of execution. Q.What is the Purpose of $_error variable ? $_error variable is available in the node following the error transition. It captures the error message, error code etc. Q.On what destinations can you use multicast? a. Topics Q.Suppose, you got an error while accessing a queue, that you don’t have necessary permissions to access the queue. What might be the solution/reason? Answer: The user that is assigned to the queue and the user used while creating Q.How does the secondary server know that the primary server is failed? Answer: Based on heartbeat intervals Q.What is JMS queue requestor? Answer: The JMS Queue Requestor activity is used to send a request to a JMS queue name and receive a response back from the JMS client Q.What is JMS topic requestor? Answer:The JMS Topic Requestor activity is used to communicate with a JMS application’s request-response service. This service invokes an operation with input and output. The request is sent to a JMS topic and the JMS application returns the response to the request. Q.How do you add ems server to administrator? Answer: Using domain utility Q.How do you remove individual messages from destinations? Answer: Using purge command. Q.What is a TIBCO Domain? Answer: Domain is a collection of hardware and software components that are used for business process integration. The domain defines the TIBCO BusinessWorks environment. Each domain must contain one and only one administration server. Each domain must have a unique domain name. Each domain may contain one or more machines but no single machine can belong to multiple domains. Each machine may have more than one type of software component. Q.What is a deployment? Answer: A deployment is a completed configured instance of an integration or project. TIBCO Designer is used to configure projects and deployments in current version. In the future, TIBCO Administrator will be used to manage deployment of projects. Q.What is the main responsibilities of the Admin server? Answer:It manages data storage for the Admin It manages tr: port options for applications It enforces security for the domian Q.What is TIBCO HAWK agent? Answer: It is the independent process that monitors applications and systems utility. Q.What are the Scripting utilities? There are two types of scripting utilities. They are 1. Buildear 2.Appmanage Q.What are the components of the TIBCO Admin? What is its use? Admin server: It manages resource in Admi domian Admin GUI: It [provides web browser interface. It allows to configure users and applications, deploys applications, monitor process and machines in admin domain. Q.Can we run multiple administrators in the same domain? Answer: No. Only one administration server is installed and configured for use per domain. Q.What is the default port where admin runs? Answer: Port: 8080 Host: Localhost Q.What is UDDI module? Answer: Universal, description, discovery and integration. It creates connections between uddi servers and web services contained in the server. If you grant permissions then you can publish web services information through uddi servers. Q.What is Resouce management? Answer: It creates application domian. It can customize machine display. It displays information about machines and process on machine. Q.What is application domain? Answer: This applicatio stores the data separetely or independently in a repository from the admin domian repository. Q.What is an Application Archieve? Answer: It provides information about the enterprise archieve file including package name, version, description and creation date. Q.Can we change the adapter from one domain to another? Answer: Yes, but you will need to uninstall the existing adapter that has joined the current domain and then reinstall the adapter and join it to the new domain. You will also be able to change the domain info directly using Domain Utility . Q.How to view tracing results for a process engine? Answer: Got to Application Management click on All service instances, click on the process engine name tracing tab, go to details Q. What is deployment choice? Answer: When configuring adminstration domian , you can set how the admin server creates and stores application data. 1. Local application data 2. Server based application data Q.What are the versions of TIBCO Admin? Answer: There are two types of two versions. 1) Repository Edition 2) Enterprise Edition. Q.Can we change the domain for one machine to another later? Answer: Yes, you can add/remove machine from domain using Damain Utility. Q.What are the restrictions of using TIBCO Admin GUI via secondary server? Answer: We cannot perform user management, deploy applications or perform any other activity for which read write access is required. Q.What do you have in the User management module? Answer: Users Roles Security Q.What TIBCO Administrator do? Answer: TIBCO Administrator supports security administration as well as monitoring and management of processes and machines. TIBCO Administrator consists of the TIBCO Administration Server and the web browser based TIBCO Administrator GUI. Q.What are TIBCO Administrator Modules ? Answer: User Management Resource Management Application Management Q.What are the cases where business process cant proceed correctly subsequent to restart from a checkpoint ? Sending HTTP response, confirming an email/jms message etc. This is because the confirmation or sending HTTP response has to done in the same session. When engine crashes these sessions are closed at their socket level. In such cases send response/confirm before checkpoint. Q.Which group do you use to wait for multiple events and proceed with the first to occur ? A 'Pick First Group'. Q.What are the two storage methods used by Tibco EMS server? Answer:File based and database Q.what files are created in file based data storage method? Answer: sync.db,async.db,meta.db Q.What information does Meta.db contain? Answer: durable subscribers, fault tolerant connections and other meta data. Q.What does flow control property specifies ? Answer: specifies the maximum size of the pending messages in server. Q.What are the destinations of messages? Answer: topics and queues. Q. In how many ways destinations for messages can be created? Answer: static administrator creates destinations and client programs uses the destinations Dynamic: here client program creates destinations during runtime Temporary: servers connected through routes communicate through temporary destinations. Q.what are the messaging models supported by ems serve? Answer:point to point ( queues), pubsub (topics), multicast (topic). Q.What is the diff between exclusive queues and non exclusives ? Answer:in exclusive – only one receiver can take message where as in non exclusive many receivers can receive msg. Q.how long the message will be stored for durable subscribers? Answer:as long as durable subscriber exists or until msg expiration time reached or storage limit has been reached. Q.what are the different delivery modes supported by ems? Answer:persistent, non persistent and reliable. Q.what is the dis advantage of reliable mode delivery? Answer:in reliable , with out knowing the status of the consumer the producer keeps sending msg to server Q.what is the condition for persistent message to be stored on disk in topics? Answer:There must be atleast one durable subscriber or one must be connected to fault tolerant connection to ems server. Q.how do you distinguish dynamic queues and static queues.? Answer:dynamic queues have * before the queue name. Q.what happens if npsend_checkmode parameter in tibemsd.conf file is enabled? Answer:Server sends acknowledgement for non persistent message. Get more practical explanation at Tibco Training Online Q.what is shared state in fault tolerant operation ? Answer:primary server and backup server have connection to shared state which contain information about client connection and persistant messages. Q. How many ways a back up server detects failure of primary server? Answer:Hearbeat failure:-Primary server sends a heartbeat message to backup server to indicate primary server is working. connection failure :-backup server detects the failure of tcp connection with primary server Q.what is the use of locking in fault tolerant operation? Answer: Inorder to prevent the backup server to take the role of primary server, the primary server logs the shared state in normal operation and during the failure of primary server backup server takes the lock and access primary server. Q.If authorization is enabled in tibemsd.config file what is the condition to configure ems server as fault tolerance? Answer: Server name and password for both primary and backup server should be same and username and password for both servers should match the server and password parameters in tibemsd.config file. Q.What are the changes to be made in config file for ems fault tolerant operation? Answer: in primary server give url of backup server to ft_active parameter and in backup server give url of primary server for ft_active parameter. Q.Different types of zones? Answer:Multihop zone and 1hop zone. Q.What is fail safe? Answer: In fail safe mode messages are frist stored in disk before sending messages so that no messages are lost. Q.What is the default port number for ems server? Answer: 7222. Q. Difference between rendezvous and ems? Answer:Rvd is bus based architecture , ems is centralized architecture Q.what are different acknowledge modes? Answer:Dups_ok_acknowlwdge,auto_acknowlwdge,client_acknowledge,no_acknowledge. Q.How many ways we can determine the life span of the message in a queue. What are they? Answer: expiration parameter in queue configuration file. JMS expiration time in queue sender. The JMS expiration time in queue sender overrides any value given in config. Q.What are the message storing mechanisms of queues? Answer: persistent and non-persistent. Persistent: messages are stored to external storage before sending. Non-persistent: not stored to any external storage. The information will not be available for retrieval. Q.What is condition to create bridge? Answer:Queus and topics must be defined as global. Q.What are the advantages and disadvantages of multicasting Answer: Advantages: as the message broadcasts only once thereby reducing the amount of bandwidth used in publish and subscribe model. Reduces the network traffic. b. Disadvantages: Offers only last-hop delivery. So can’t be used to send messages between servers. contact for more on Tibco Online Training
Continue reading