Ab Initio Interview Questions

Q.What is surrogate key? Answer: surrogate key is a system generated sequential number which acts as a primary key. Q.Differences Between Ab-Initio and Informatica? Answer: Informatica and Ab-Initio both support parallelism. But Informatica supports only one type of parallelism but the Ab-Initio supports three types of parallelisms. Component Data Parallelism Pipe Line parallelism.  We don't have scheduler in Ab-Initio like Informatica , you need to schedule through script or you need to run manually. Ab-Initio supports different types of text files means you can read same file with different structures that is not possible in Informatica, and also Ab-Initio is more user friendly than Informatica . Informatica is an engine based ETL tool, the power this tool is in it's transformation engine and the code that it generates after development cannot be seen or modified. Ab-Initio is a code based ETL tool, it generates ksh or bat etc. code, which can be modified to achieve the goals, if any that can not be taken care through the ETL tool itself. Initial ramp up time with Ab-Initio is quick compare to Informatica, when it comes to standardization and tuning probably both fall into same bucket. Ab-Initio doesn't need a dedicated administrator, UNIX or NT admin will suffice, where as Informatica need a dedicated administrator. With Ab-Initio you can read data with multiple delimiter in a given record, where as Informatica force you to have all the fields be delimited by one standard delimiter Error Handling - In Ab-Initio you can attach error and reject files to each transformation and capture and analyze the message and data separately. Informatica has one huge log! Very inefficient when working on a large process, with numerous points of failure. Q.What is the difference between rollup and scan? Answer : By using rollup we cant generate cumulative summary records for that we will be using scan Q.Why we go for Ab-Initio? Answer : Ab-Initio designed to support largest and most complex business applications. We can develop applications easily using GDE for Business requirements. Data Processing is very fast and efficient when compared to other ETL tools. Available in both Windows NT and UNIX Q.What is the difference between partitioning with key and round robin? Answer: PARTITION BY KEY: In this, we have to specify the key based on which the partition will occur. Since it is key based it results in very well balanced data. It is useful for key dependent parallelism. PARTITION BY ROUND ROBIN: In this, the records are partitioned in sequential way, distributing data evenly in blocksize chunks across the output partition. It is not key based and results in well balanced data especially with blocksize of 1. It is useful for record independent parallelism. Q.How to Create Surrogate Key using Ab Initio? Answer. A key is a field or set of fields that uniquely identifies a record in a file or table. A natural key is a key that is meaningful in some business or real-world sense. For example, a social security number for a person, or a serial number for a piece of equipment, is a natural key. A surrogate key is a field that is added to a record, either to replace the natural key or in addition to it, and has no business meaning. Surrogate keys are frequently added to records when populating a data warehouse, to help isolate the records in the warehouse from changes to the natural keys by outside processes. Q.What are the most commonly used components in a Ab-Initio graphs? Answer: input file / output file input table / output table lookup / lookup_local reformat gather / concatenate join run sql join with db compression components filter by expression  sort (single or multiple keys) rollup partition by expression / partition by key Q.How do we handle if DML changing dynamically? Answer: There are lot many ways to handle the DMLs which changes dynamically with in a single file. Some of the suitable methods are to use a conditional DML or to call the vector functionality while calling the DMLs. Q.What is meant by limit and ramp in Ab-Initio? Which situation it’s using? Answer: The limit and ramp are the variables that are used to set the reject tolerance for a particular graph. This is one of the option for reject-threshold properties. The limit and ramp values should pass if enables this option. Graph stops the execution when the number of rejected records exceeds the following formula. limit + (ramp * no_of_records_processed). The default value will be set to 0.0. The limit parameter contains an integer that represents a number of reject events The ramp parameter contains a real number that represents a rate of reject events in the number of records processed. Typical Limit and Ramp settings Limit = 0  Ramp = 0.0 Abort on any error Limit = 50 Ramp = 0.0 Abort after 50 errors Limit = 1   Ramp = 0.01 Abort if more than 2 in 100 records causes error Limit = 1  Ramp = 1     Never Abort Q.What are data mapping and data modeling? Answer: Data mapping deals with the transformation of the extracted data at FIELD level i.e. the transformation of the source field to target field is specified by the mapping defined on the target field. The data mapping is specified during the cleansing of the data to be loaded. For Example: source; string(35) name = "Siva Krishna      "; target; string("01") nm=NULL("");/*(maximum length is string(35))*/ Then we can have a mapping like: Straight move.Trim the leading or trailing spaces. The above mapping specifies the transformation of the field nm. Q.What is the difference between a DB config and a CFG file? Answer : .dbc file has the information required for Ab Initio to connect to the database to extract or load tables or views. While .CFG file is the table configuration file created by db_config while using components like Load DB Table Q.What is mean by Layout? Answer: A layout is a list of host and directory locations, usually given by the URL of a file or multi file. If a layout has multiple locations but is not a multi file, the layout is a list of URLs called a custom layout. A program component's layout is the list of hosts and directories in which the component runs. A dataset component's layout is the list of hosts and directories in which the data resides. Layouts are set on the Properties Layout tab. The layout defines the level of Parallelism . Parallelism is achieved by partitioning data and computation across processors. Q.What are Cartesian joins? Answer: A Cartesian join will get you a Cartesian product. A Cartesian join is when you join every row of one table to every row of another table. You can also get one by joining every row of a table to every row of itself. Q.What is the function you would use to transfer a string into a decimal? Answer: For converting a string to a decimal we need to typecast it using the following syntax, out.decimal_field :: ( decimal( size_of_decimal ) ) string_field; The above statement converts the string to decimal and populates it to the decimal field in output. Q.How do we handle if DML changing dynamically? Answer: There are lot many ways to handle the DMLs which changes dynamically with in a single file. Some of the suitable methods are to use a conditional DML or to call the vector functionality while calling the DMLs.we can use MULTIREFORMAT component to handle dynamically changing DML's. Q.Explain the differences between api and utility mode? Answer: API and UTILITY are the two possible interfaces to connect to the databases to perform certain user specific tasks. These interfaces allow the user to access or use certain functions (provided by the database vendor) to perform operation on the databases. The functionality of each of these interfaces depends on the databases. API has more flexibility but often considered as a slower process as compared to UTILITY mode. Well the trade off is their performance and usage.  Contact for   Ab initio training  Q.What are the uses of is_valid, is_define functions? Answers: is_valid and is_defined are Pre defined functions is valid(): Tests whether a value is valid. The is_valid function returns:

  • The value 1 if expr is a valid data item.
  • The value 0 if the expression does not evaluate to NULL.

If expr is a record type that has field validity checking functions, the is_valid function calls each field validity checking function. The is_valid function returns 0 if any field validity checking function returns 0 or NULL. Example: is_valid(1) 1 is_valid("oao") 1 is_valid((decimal(8))"1,000") 0 is_valid((date("YYYYMMDD"))"19960504") 1 is_valid((date("YYYYMMDD"))"abcdefgh") 0 is_valid((date("YYYY MMM DD"))"1996 May 04") 1 is_valid((date("YYYY MMM DD"))"1996*May&04") 0 is defined(): Tests whether an expression is not NULL. The is_defined function returns: The value 1 if expr evaluates to a non NULL value. The value 0 otherwise. The inverse of is_defined is is_null. Q.What is meant by merge join and hash join? Where those are used in Ab Initio? Answer: The command line syntax for Join Component consists of two commands. The first one calls the component, and is one of two commands:

  • mp merge join to process sorted input
  • mp hash join to process unsorted input

Q.What is data mapping and data modelling? Answer: Data mapping deals with the transformation of the extracted data at FIELD level i.e. the transformation of the source field to target field is specified by the mapping defined on the target field. The data mapping is specified during the cleansing of the data to be loaded What is the difference between sandbox and EME, can we perform checkin and checkout through sandbox/ Can anybody explain checkin and checkout? Sandboxes are work areas used to develop, test or run code associated with a given project. Only one version of the code can be held within the sandbox at any time. The EME Datastore contains all versions of the code that have been checked into it.A particular sandbox is associated with only one Project where as a Project can be checked out to a number of sandboxes Q.What are the Graph parameter? Answer: The graph paramaters are one which are added to the respective graph. You can added the graph parameters by selecting the edit>parameters from the menu tab. Here's the example for the graph parameters. If you want to run a same graph for n number of files in a directory, You can assign a graph parameter to the input file name and you can supply the paramter value from the script before invoking the graph. How to Schedule Graphs in Ab Initio, like workflow Schedule in Informatica? And where we must is Unix shell scripting in Ab Initio? Q.How to Improve Performance of graphs in Ab initio? Give some examples or tips. There are so many ways to improve the performance of the graphs in Ab initio. Here  are few points

  • Use MFS system using Partion by Round by robin.
  • .If needed use lookup local than lookup when there is a large data.
  • Takeout unnecessary components like filter by exp instead provide them in reformat/Join/Rollup.
  • Use gather instead of concatenate.
  • Tune Max_core for Optional performance.
  • Try to avoid more phases.
  • Go Parallel as soon as possible using Ab Initio Partitioning technique.
  • Once Data Is partitioned do not bring to serial , then back to parallel. Repartition instead.
  • For Small processing jobs serial may be better than parallel.
  • Do not access large files across NFS, Use FTP component
  • Use Ad Hoc MFS to read many serial files in parallel and use concat coponenet.
  • Using Phase breaks let you allocate more memory to individual component and make your graph run faster
  • Use Checkpoint after the sort than land data on to disk
  • Use Join and rollup in memory feature
  • Best performance will be gained when components can work with in memory by MAX CORE.
    • MAR CORE for SORT is calculated by finding size of input data file.
  • For In memory join memory needed is equal to non driving data size + overhead.
    • If in memory join cannot fir its non driving inputs in the provided MAX CORE then it will drop all the inputs to disk and in memory does not make sence.
  • Use rollup and Filter by EX as soon as possible to reduce number of records.
    • When joining very small dataset to a very large dataset, it is more efficient to broadcast the small dataset to MFS using broadcast component or use the small file as lookup.
  • Use MFS, use Round robin partition or load balance if you are not joining or rollup
    • Filter the data in the beginning of the graph.
  • Take out unnecessary components like filter by expression instead use select expression in join, rollup, reformat etc
    • Use lookups instead of joins if you are joining small tale to large table.
  • Take out old components use new components like join instead of math merge .
  • Use gather instead of concat
  • Use Phasing if you have too many components
  • Tune the max core for optimal performance
  • Avoid sorting data by using in memory for smaller datasets join
  • Use Ab Initio layout instead of database default to achieve parallel loads
  • Change AB_REPORT parameter to increased monitoring duration ( )
  • Use catalogs for reusability
  • Use sort after partition component instead of before.
  • Partition the data as early as possible and departition the data as late as possible.
  • Filter unwanted fields/records as early as possible.
  • Try to avoid the usage of join with db component.

Q.How does force_error function work ? If we set never abort in reformat , will force_error stop the graph or will it continue to process the next set of records ? Answer: force_error as the name suggests it works on as to force an error in case of not meeting of any conditions mentioned.The function can be used as per the requirement. If you want to stop execution of graph in case of not meeting a specific condition say you have to compare the input and out put records reconciliation and the graph should fail if the input record count is not same as output record count "THEN set the reject-threshold to Abort on first reject" so that the graph stops. Note:- force_error directs all the records meeting the condition to reject port with the error message to error port. In certain special circumstances you can also use to treat the reject port as an additional data flow path leaving the component.When using force_error to direct valid records to the reject port for separate processing you must remember that invalid records will also be sent there. Q.What are the most commonly used components in a Ab inition graph?can anybody give me a practical example of a trasformation of data, say customer data in a credit card company into meaningful output based on business rules? Answer: The most commonly used components in to any Ab Initio project are input file/output file input table/output table lookup file reformat,gather,join,runsql,join with db,compress components,sort,trash,partition by expression,partition by key ,concatinate Q.How to work with parameterized graphs? Answer: One of the main purpose of the parameterized graphs is that if we need to run the same graph for n number of times for different files, we set up the graph parameters like $INPUT_FILE, $OUTPUT_FILE etc and we supply the values for these in the Edit>parameters.These parameters are substituted during the run time. we can set different types of parameters like positional, keyword, local etc. The idea here is, instead of maintaining different versions of the same graph, we can maintain one version for different files. Q.What is the use of unused port in join component? Answer: While joining two input flows, records which match the join condition goes to output port and we can get the records which do not meet the join condition at unused ports. Q.What is meant by dedup Sort with null key? Answer: If we don't use any key in the sort component while using the dedup sort, then the output depends on the keep parameter. It considers whole records as one group first - only the first record last - only last record unique_only - there will be no records in the output file. Q.Hi can anyone tell me what happens when the graph run? i.e The Co-operating System will be at the host, We are running the graph at some other place. How the Co-operating System interprets with Native OS? Answer: CO-operating system is layered on the top of the native OS When a graph is executed it has to be deployed in host settings and connection method like rexec, telnet, rsh, rlogin This is what the graph interacts with the co>op. when ever you press Run button on your GDE,the GDE genarates a script and the genarated script will be transfered to your host which is specified in to your GDE run settings. then the Co>operating system interprets this script and executes the script on different mechins(if required) as a sub process(threads),after compleation of each sub process,these sub_processes will  return status code to main process this main process in tern returns error or sucess code of the job to GDE Q. Difference between conventional loading and direct loading? When it is used in real time. Answer: Conventional Load: Before loading the data, all the Table constraints will be checked against the data. Direct load:(Faster Loading) All the Constraints will be disabled. Data will be loaded directly. Later the data will be checked against the table constraints and the bad data won't be indexed. Api conventional loading utility direct loading. Q.explain the environment varaibles with example.? Answer: Environemental variables server as global variables in unix envrionment. They are used for passing on values from a shell/ process to another. They are inherited by Abinitio as sandbox variables/ graph parameters like AI_SORT_MAX_CORE AI_HOME AI_SERIAL AI_MFS etc. To know what all variables exist, in your unix shell, find out the naming convention and type a command like  | grep . This will provide you a list of all the variables set in the shell. You can refer to the graph parameters/ components to see how these variables are used inside Abinitio. Q.How to find the number of arguments defined in graph ? Answer: List of shell arguments $*. then what is $# and $? ... $#  - No of positional parameters $? - the exit status of the last executed command Q.How many numbers of inputs join component support ? Answer: Join will support maximum of 60 inputs and minimum is 2 inputs. Q.What is max-core? What are the Components that use MAX_CORE? Answer: The value of the MAX_CORE parameter is that it determines the maximum amount of memory, in bytes, that a specified component will use. If the component is running in parallel, the value of MAX_CORE represents the maximum memory usage per partition. If MAX_CORE is set too low the component will run slower than expected. Too high and the component will use too many machine resources and slow up Dramatically. The Max core parameter can be defined in the following components:

  • SCAN
  • in-memory SCAN
  • ROLLUP
  • in-memory ROLLUP
  • in-memory JOIN
  • SORT

Whenever these components are used and have the component set to parameter set to “In-memory; Inputs need not be sorted”, a max-core variable must be specified. Q.What does dependency analysis mean in Ab Initio?  Answer : Dependency Analysis It analyses the Project for the dependencies within and between the graphs. The EME examines the Project and develops a  survey tracing how data is transformed and transferred field by field from component to component. Dependency analysis has two basic steps:

  • Translation
  • Analysis

Analysis Level:  In the check in wizard’s advanced options, the analysis level can be specified as one of the following:

  • None:
  • No dependency analysis is performed during the check in.
  • Translation only:

Graph being checked in is translated to data store format but no error checking is done. This is the minimum requirement during check in.

  • Translation with checking: (Default)

Along with the translation, errors, which will interfere with dependency analysis, are checked for. These include:

  • Absolute paths
  • Undefined parameters
  • dml syntax errors
  • Parameter reference to objects that can’t be resolved
  • Wrong substitution syntax in parameter definition
  • Full Dependency Analysis:

Full dependency analysis is done during check in. It is not recommended as takes a long time and in turn can delay the check in process. What to analyse:

  • All files:

Analyse all files in the Project

  • All unanalysed files:

Analyse all files that have been changed or which are dependent on or required by files that have changed since the last time they were analysed.

  • Only my checked in files:

All files checked in by you would be analysed if they have not been before.

  • Only the file specified:

Apply analysis to the file specified only. Q.what is the difference between .dbc and .cfg file? Answer: .cfg file is for the remote connection and .dbc is for connecting the database. .cfg contains :

  • The name of the remote machine
  • The username/pwd to be used while connecting to the db.
  • The location of the operating system on the remote machine.
  • The connection method.

 .dbc file contains :

  • The database name
  • Database version
  • Userid/pwd
  • Database character set and some more.

Q.What are the Graph parameter? Answer: There are 2 types of graph parameters in AbInitio 1. local parameter 2. Formal parameters.(those parameters working at runtime) Q.How many types of joins are in Ab-Initio? Answer: Join is based on a match key for inputs, Join components describes out port, unused ports, reject ports and log port. Inner Joins: The most common case is when join-type is Inner Join. In this case, if each input port contains a record with the same value for the key fields, the transform function is called and an output record is produced. If some of the input flows have more than one record with that key value, the transform function is called multiple times, once for each possible combination of records, taken one from each input port.Whenever a particular key value does not have a matching record on every input port and Inner Join is specified, the transform function is not called and all incoming records with that key value are sent to the unused ports. Full Outer Joins: Another common case is when join-type is Full Outer Join: if each input port has a record with a matching key value, Join does the same thing as it does for an Inner Join. If some input ports do not have records with matching key values, Join applies the transform function anyway, with NULL substituted for the missing records. The missing records are in effect ignored. With an Outer Join, the transform function typically requires additional rules (as compared to an Inner Join) to handle the possibility of NULL inputs. Explicit Joins: The final case is when join-type is Explicit. This setting allows you to specify True or False for the record-required n parameter for each in n port. The settings you choose determine when Join calls the transform function. The join-type and record-required n Parameters The two intersecting ovals in the diagrams below represent the key values in the records on the two ports — in0 and in1 — that are the inputs to join: For each possible setting of join-type or (if join-type is Explicit) combination of settings for record-required n, the shaded region of each of the following diagrams represents the inputs for which Join calls the transform. Join ignores the records that have key values represented by the white regions, and consequently those records go to the unused port. Q.what is semi-join ? Answer:  A left semi-join on two input files, connected to ports in0 and in1 is the  Inner Join .The dedup0 parameter is set to Do not dedup this input, but dedup1 is set to Dedup this input before joining. Duplicates were removed from only the in1 port, that is, from Input File 2. semijoins can be achieved by using the join component with parameter Join Type set to explicit join and the parameters recordrequired0,recordrequired1 set one to true and the other false depending on whether you require left outer or right outer join. in abinitio,there are 3 types of join... 1.inner join.        2.outer join         and 3.semi join. for inner join 'record_requiredn' parameter is true for all in ports. for outer join it is false for all the in ports. if u want the semi join u put 'record_required n' as true for the required component and false for other components.. Q.How to do we run sequences of jobs? like output of A JOB is Input to B how do we co-ordinate the jobs ? Answer: By writing the wrapper scripts we can control the sequence of execution of more than one job. Q.How would you do performance tuning for already built graph ? Can you let me know some examples? Answer: example :- 1.)suppose sort is used in fornt of merge component its no use of using sort ! because we have sort component built in merge. 2) we use lookup instead of JOIN,Merge Component. 3.) suppose we want to join the data coming from 2 files and we don’t want duplicates we will use union function instead of adding additional component for duplicate remover. Q.What is the relation between EME , GDE and Co-operating system ? Answer: EME is said as enterprise metadata env, GDE as graphical development env and Co-operating system can be said as abinitio server relation b/w this CO-OP, EME AND GDE is as follows Co operating system is the Abinitio Server.This co-op is installed on particular O.S platform that is called NATIVE O.S .coming to the EME, its i just as repository in informatica , its hold the metadata,transformations,db config files source and targets information. coming to GDE its is end user environment where we can develop the graphs(mapping just like in informatica) designer uses the GDE and designs the graphs and save to the EME or Sand box it is at user side where EME is at server side. Q.When we use Dynamical DML? Answer: Dynamic DML is used if the input meta data can change. Example: at different time different input files are received for processing which have different dml. in that case we can use flag in the dml and the flag is first read in the input file received and according to the flag its corresponding dml is used. Q.Explain the differences between Replicate and BROADCAST? Answer: Replicate takes records from input flow arbitrarily combines and gives to components which connected to its output port.Broadcast is partition component copies the input record to components which connected to its output port.Consider one example,input file contains 4 records and level of parallelism is 3 then Replicate gives 4 records to each component connected to it's out port whereas Broadcast gives 12 records to each component connected to it's out port. Q.How do you truncate a table? Answer: From Abinitio run sql component using the DDL truncate table By using the Truncate table component in Ab Initio. Q.How to get DML using Utilities in UNIX? Answer: By using the command m_db gendml -table Q.Explain the difference between REFORMAT and Redefine FORMAT? Answer: Reformat changes the record format by adding or deleting fields in the DML record. Length of the record can be changed. Redefine copies it's input flow to it's out port without any transform. Redefine is used to rename the fields in the DML. But Length of record should not change. Q.How to work with parameterized graphs? Answer: Parameterized graphs specifies everything through parameters. i.e,data locations in input/output files,DMLs etc... Q.What is driving port? When do you use it? Answer: When you set the sorted-input parameter of "JOIN" component to "In memory: Input need not be sorted",  you can find the driving port. Generally driving port use to improve performance in a graph. The driving input is the largest input. All other inputs are read into memory. For example, suppose the largest input to be joined is on the in1 port. Specify a port number of 1 as the value of the driving parameter. The component reads all other inputs to the join — for example, in0, and in2 — into memory. Default is 0, which specifies that the driving input is on port in0. Join also improves performance by loading all records from all inputs except the driving input into main memory. driving port in join supplies the data that drives join . That means, for every record from the driving port, it will be compared against the data from non driving port. We have to set the driving port to the larger dataset sothat non driving data which is smaller can be kept in main memory for speedingup the operation.  Contact for Abinitio Online Training Q.How can we test the ab-Initio manually and automation? Answer: By running a graph through GDE is manual test. By running a graph using deployed script is automated test. Q.What is the difference between partitioning with key and round robin? Answer: Partition by Key or hash partition -> This is a partitioning technique which is used to partition data when the keys are diverse. If the key is present in large volume then there can large data skew. But this method is used more often for parallel data processing. Round robin partition is another partitioning technique to uniformly distribute the data on each of the destination data partitions. The skew is zero in this case when no of records is divisible by number of partitions. A real life example is how a pack of 52 cards is distributed among 4 players in a round-robin manner Q.what is skew and skew measurement? Answer: skew is the measure of data flow to each partition . suppose i/p is coming from 4 files and size is 1 gb 1 gb= ( 100mb+200mb+300mb+5oomb) 1000mb/4= 250 mb (100- 250 )/500= --> -150/500 == cal ur self it wil come in -ve value. Cal clu for 200,500,300. +ve value of skew is all ways desirable. skew is a indericet measure of graph. Q.What is error called 'depth not equal'? Answer: When two components are linked together if their layout does not match then this problem can occur during the compilation of the graph. A solution to this problem would be to use a partitioning component in between if there was change in layout. Latest Ab initio Interview Questions              Ab initio Interview Questions Pdf Q.What is the function you would use to transfer a string into a decimal? Answer : For converting a string to a decimal we need to typecast it using the following syntax, out.decimal_field :: ( decimal( size_of_decimal ) ) string_field; The above statement converts the string to decimal and populates it to the decimal field in output. Q.Which one is faster for processing fixed length dmls or delimited dmls and why? Answer: Fixed length,because for delimited dml it has to check for delimiter every time but for fixed length dml directly length will b taken. Q.What are kinds of layouts does ab-Initio supports? Answer: Ab-Initio supports two kinds of Layouts: Serial Layout Multi layout. In Ab-Initio Layout tells which component should run where and it also gives level of parallelism. For serial Layout,level of parallelism is 1. For Multi layout,Level of parallelism depends on data partition. Q.How can you run a graph infinitely? Answer: To run a graph infinitely,

  • The end script of the graph should call the .ksh file of the graph. Thus if the name of the graph is abc.mp then in the end script of the graph there should be a call to abc.ksh. Then this graph will run infinitely.
  • Run the deployed script in a loop infinitely.

 Q.what is local and formal parameter ? Answer: Two are graph level parameters but in local you need to initialize the value at the time of declaration where as globle no need to initialize the data it will promt at the time of running the graph for that parameter. local parameter is like local variable in c language where as formal parameter is like command line argument we need to pass at run time. Q.what is BRODCASTING and REPLICATE ? Answer:Broadcast can do everything that replicate does broadcast can also send singlt file to mfs with out splitiong and brodcast makes multiple copies of single file mfs. Replicate combines data rendomly, receives in single flow and write a copy of that flow in each of output flow. replicate generates multiple straight flows as the output where as broadcast results single fanout flow. replicate improves component parallelism where as broadcast improves data parallelism. Broadcast - Takes data from multiple inputs, combines it and sends it to all the output ports. Eg - You have 2 incoming flows (This can be data parallelism or component parallelism) on Broadcast component, one with 10 records & other with 20 records. Then on all the outgoing flows (it can be any number of flows) will have 10 + 20 = 30 records Replicate - It replicates the data for a particular partition and send it out to multiple out ports of the component, but maintains the partition integrity. Eg - Your incoming flow to replicate has a data parallelism level of 2. with one partition having 10 recs & other one having 20 recs. Now suppose you have 3 output flos from replicate. Then each flow will have 2 data partitions with 10 & 20 records respectively. Q.what is the importance of EME in abinitio? Answer: EME is a repository in Ab Inition and it used for checkin and checkout for graphs also maintains graph version. Q.what is m_dump Answer: It is a co-opating system's command that we use to view data from the command prompt. m_dump command prints the data in a formatted way. m_dump Q.what is the syntax of m_dump command? Answer: m_dump Q.what are differences between different GDE versions(1.10,1.11,1.12,1.13and 1.15)? Answer: what are differences between different versions of Co-op? 1.10  is a non key version and rest are key versions. There are lot of components added and revised at following versions. Q.How to run the graph without GDE? Answer: In the run directory a graph can be deployed as a .ksh file. Now, this .ksh file can be run at the command prompt as: ksh Q.What is the Difference between DML Expression and XFR Expression ? Answer: dml expression means abinitio dml are stored or saved in a file and dml describs the data interms of expressions that performs simple computations such as files, dml also contains transform functions that control data transforms,and also describs data interms of keys that specify grouping or non grouping ,that means dml expression are non embedded record format files .xfr means simply say it is non embedded transform files ,Transform function is express business rules ,local variables, statements and as well as conn between this elements and the input and the ouput fields. Q.How Does MAXCORE works? Answer: Maxcore is a temporary memory used to sort the records Maxcore is a value (it will be in Kb). Whenever a component is executed it will take that much memory we specified for execution Maxcore is the maximum memory that could be used by a component in its execution. Q.What is $mpjret? Where it is used in ab-initio? Answer: $mpjret is return value of shell command "mp run" execution of Ab-Initio graph. this is generally treated as graph execution status return value Q.What is the latest version that is available in Ab-initio? Answer: The latest version of GDE ism1.15 AND Co>operating system is 2.14 Q.What is mean by Co>Operating system and why it is special for Ab-initio ? Answer: Co-Operating systems, that itself means a lot, it's not merely an engine or interpretor. As it says, it's an operating system which co-exists with another operating system. What does that mean.... in layman's term abinitio, unlike other applications, does not sit as a layer on top of any OS? It itself has quite a lot of operating system level capabilities such as multi files, memory management and so on and this way it completely integrate with any other OS and work jointly on the available hardware resources. This sort of Synergy with OS optimize the utilization of available hardware resources. Unlike other applications (including most other ETL tools) it does not work like a layer and interprete the commands.  That is the major difference with other ETL tools , this is the reason why abinitio is much much faster than any other ETL tool and obviously much much costlier as well. Q.How to take the input data from an excel sheet? Answer: There is a Read Excell component that reads the excel either from host or from local drive. The dml will be a default one. Through Read Excel component in $AB_HOME we can read excell directly. Q.How will you test a dbc file from command prompt ?? Answer: You can test a dbc file from command prompt(Unix) using m_db test command which gives the checking of data base connection, version of data base, user Q.Which one is faster for processing fixed length dmls or delimited dmls and why? Answer: Fixed length DML's are faster because it will directly read the data of that length without any comparisons but in delimited one,s every character is to be compared and hence delays Q.what are the contineous components in Abinitio? Answer: Contineous components used to create graphs,that produce useful output file while running continously Ex:- Contineous rollup,Contineous update,batch subscribe Q.How can I calculate the total memory requirement of a graph? Answer:

  • You can roughly calculate memory requirement as:
  • Each partition of a component uses:~ 8 MB + max-core (if any)
  • Add size of lookup files used in phase (if multiple components use same lookup only count it once) Multiply by degree of parallelism. Add up all components in a phase; that is how much memory is used in that phase.
  • Add size of input and output datasets(Total memory requirement of a graph) > (the largest-memory phase in the graph).

Q.What is multistage component? Answer: Multistage component are nothing but the transform components where the records are transformed into five stages like input selection, temporary records initialization, processing , finalization and output selection. examples of multistage components are like

  • Rollup
  • Scan
  • Normalize
  • Denormalize sorted.

Q.what is the use of aggregation when we have rollup as we know rollup component in ab-Initio is used to summarize group of data record. then where we will use aggregation ? Answer:Rollup has a good control over record selection grouping and aggregation as compared to that of aggregate. Rollup is an updated version of aggregate. When Rollup is in template mode ,it has aggregation functions to use. So it is better to go for Rollup. Q.Phase verses Checkpoint ? Answer: Difference between a phase and checkpoint . phases are used to break up a graph so that it does not use up all the memory , it limits the number of active components thus reduce the number of components running in parallel hence improves the performance .Phases make possible the effective utilization of the resources such as memory disk space and CPU So when we have memory consuming components in the straight flow and the data in flow is in millions we can separate the process out in one phase so as the CPU allocation is more for the process to consume less time for the whole process to get over. Temporary files created during a phase will be deleted after completion of that phase. Don't put phase after Replicate,sort,across all to all flows and temporary files. Check points are used for the purpose of recovery. In contrary Checkpoints are like save points .These are required if we need to run the graph from the saved last phase recovery file(phase break checkpoint) if it fails unexpectedly. At job start,output datasets are copied into temporary files and after the completion of check pointing all datasets and job state are copied into temporary files. so if any failure occurs job can be run from last committed check point. Use of phase breaks which includes the checkpoints would degrade the performance but ensures save point run. The major difference between these two is that phasing deletes the intermediate files made at the end of each phase as soon as it enters the next phase. On the other hand what check pointing does is...it stores these intermediate files till the end of the graph. Thus we can easily use the intermediate file to restart the process from where it failed. But this cannot be done in case of phasing. We can have phases without check points. We can not assign checkpoints without phases. Q.In Ab-Initio, How can you display records between 50-75.. ? Answer: Input dataset having 100 records. I want records between 50-75 then use m_dump -start 50 -end 75 For serial and mfs there are many ways the components can be used. 1.Filter by Expression : use next_in_sequence() >50 && next_in_sequence() <75 2.We can also use multiple LEADING RECORDS components for meeting the requirement. If you have the access to Co>Op then you can try an alternate. Say suppose the input file is : file 1 Use the Run program component in GDE and write the below command: `sed -n50 75p file 1 > file 2` Q.What is the order of evaluation of parameters? Answer: When you run a graph, parameters are evaluated in the following order

  • The host setup script is run.Common (i.e, included) sandbox parameters are evaluated.
  • Sandbox parameters are evaluated.
  • The project-start.ksh script is run.
  • Graph parameters are evaluated.
  • The graph Start Script is run.
  • The execution of process is run simultaneously based component’s layouts.
  • The Lookup files is run
  • The graph Meta data is checking process.
  • The in/out file paths with files are checking.
  • The graph runs as order of phase0, phase1, phase2,..

Q.How do you convert 4-way MFS to 8-way mfs? Answer: By partitioning. we can use any partition method to partition. Partitioning methods are: Partition by Round-robin Broadcast Partition by Key Partition by Expression Partition by Range Partition by Percentage Partition by Load Balance Q.For data parallelism,we can use partition components. For component parallelism,we can use replicate component.Like this which component(s) can we use for pipeline parallelism? Answer:When connected sequence of components of the same branch of graph execute concurrently is called pipeline parallelism. Components like reformat where we distribute input flow to multiple o/p flow using output index depending on some selection criteria and process those o/p flows simultaneously creates pipeline parallelism. But components like sort where entire i/p must be read before a single record is written to o/p can not achieve pipeline parallelism. Q.what is meant by fancing in abinitio ? Answer:The word Abinitio means from the beginning. did you mean "fanning" ? "fan-in" ? "fan-out" ? Q.how to retrive data from database to source in that case whice componenet is used for this? Answer:To unload (retrive)  Data from the database DB2, Informix, or Oracle we have components like Input Table and Unload DB Table by using these two components we can unload data from the database Q.what is the relation between EME , GDE and Co-operating system ? Answer: EME is said as enterprise metdata env, GDE as graphical devlopment env and Co-operating sytem  can be said as asbinitio server relation b/w this CO-OP, EME AND GDE is as fallows Co operating system is the Abinitio Server.   this co-op is installed on  perticular O.S platform that is called NATIVE O.S .comming to the EME, its i just as repository in informatica , its hold the metadata,trnsformations,db config files source and targets informations. comming to GDE its is end user envirinment where we can devlop the graphs(mapping just like in informatica) desinger uses the GDE and designs the graphs and save to the EME or Sand box it is at user side.where EME is ast server side. Q.what is the use of aggregation when we have rollup as we know rollup component in abinitio is used to summirize group of data record. then where we will use aggregation ? Answer: Aggregation and Rollup both can summerise the data but rollup is much more convenient to use. In order to understand how a particular summerisation being rollup is much more explanatory compared to aggregate. Rollup can do some other functionalities like input and output filtering of records. Q.what are kinds of layouts does ab initio supports Answer: Basically there are serial and parallel layouts supported by AbInitio. A graph can have both at the same time. The parallel one depends on the degree of data parallelism. If the multi-file system is 4-way parallel then a component in a graph can run 4 way parallel if the layout is defined such as it's same as the degree of parallelism. Q.How can you run a graph infinitely? Answer:To run a graph infinitely, the end script in the graph should call the .ksh file of the graph. Thus if the name of the graph is abc.mp then in the end script of the graph there should be a call to abc.ksh. Like this the graph will run infinitely. Q.How do you add default rules in transformer? Answer: Double click on the transform parameter of parameter tab page of component properties, it will open transform editor. In the transform editor click on the Edit menu and then select Add Default Rules from the drop down. It will show two options - 1) Match Names 2) Wildcard. Q.Do you know what a local lookup is? Answer: If your lookup file is a multifile and partioned/sorted on a particular key then local lookup function can be used ahead of lookup function call. This is local to a particular partition depending on the key. Lookup File consists of data records which can be held in main memory. This makes the transform function to retrieve the records much faster than retrieving from disk. It allows the transform component to process the data records of multiple files fastly. Q.What is the difference between look-up file and look-up, with a relevant example? Answer: Generally Lookup file represents one or more serial files(Flat files). The amount of data is small enough to be held in the memory. This allows transform functions to retrive records much more quickly than it could retrive from Disk. A lookup is a component of abinitio graph where we can store data and retrieve it by using a key parameter. A lookup file is the physical file where the data for the lookup is stored. Q.how to handle if DML changes dynamically in abinitio Answer: If the DML changes dynamically then both dml and xfr has to be passed as graph level parameter during the runtime. By parametrization or by conditional record format or by metadata Q.Explain what is lookup? Answer: Lookup is basically a specific dataset which is keyed. This can be used to mapping values as per the data present in a particular file (serial/multi file). The dataset can be static as well dynamic ( in case the lookup file is being generated in previous phase and used as lookup file in current phase). Sometimes, hash-joins can be replaced by using reformat and lookup if one of the input to the join contains less number of records with slim record length. AbInitio has built-in functions to retrieve values using the key for the lookup. Q.What is a ramp limit? Answer: The limit parameter contains an integer that represents a number of reject events . The ramp parameter contains a real number that represents a rate of reject events in the number of records processed. no of bad records allowed = limit + no of records*ramp. ramp is basically the percentage value (from 0 to 1) This two together provides the threshold value of bad records. Q.Have you worked with packages? Answer: Multistage transform components by default uses packages. However user can create his own set of functions in a transfer function and can include this in other transfer functions. Q.Have you used rollup component? Describe how. Answer: If the user wants to group the records on particular field values then rollup is best way to do that. Rollup is a multi-stage transform function and it contains the following mandatory functions. 1. initialise 2. rollup 3. finalise Also need to declare one temporary variable if you want to get counts of a particular group. For each of the group, first it does call the initialise function once, followed by rollup function calls for each of the records in the group and finally calls the finalise function once at the end of last rollup call. Q.How do you add default rules in transformer? Answer: In case of reformat if the destination field names are same or subset of the source fields then no need to write anything in the reformat xfr unless you dont want to use any real transform other than reducing the set of fields or split the flow into a number of flows to achive the functionality. 1)If it is not already displayed, display the Transform Editor Grid. 2)Click the Business Rules tab if it is not already displayed. 3)Select Edit > Add Default Rules. Add Default Rules — Opens the Add Default Rules dialog. Select one of the following: Match Names — Match names: generates a set of rules that copies input fields to output fields with the same name. Use Wildcard (.*) Rule — Generates one rule that copies input fields to output fields with the same name. Q.What is the difference between partitioning with key and round robin? Answer: Partition by Key or hash partition -> This is a partitioning technique which is used to partition data when the keys are diverse. If the key is present in large volume then there can large data skew. But this method is used more often for parallel data processing. Round robin partition is another partitioning technique to uniformly distribute the data on each of the destination data partitions. The skew is zero in this case when no of records is divisible by number of partitions. A real life example is how a pack of 52 cards is distributed among 4 players in a round-robin manner. If you have some 30 cards taken at random from 52 card pack-------If take the card color as key(red or white) and distribute then the no of cards in each partion may vary much.But in Round robin , we distribute with block size , so the variation is limited to the block size Partition by Key - Distribute according to the key value Partition by Round Robin - Distribute a predefined number of records to one flow and then the same numbers of records to the next flow and so on. After the last flow resumes the pattern and almost evenly distributes the records... This patter is called round robin fashion. Q.How do you truncate a table? (Each candidate would say only 1 of the several ways to do this.) Answer: From Abinitio run sql component using the DDL "trucate table” By using the Truncate table component in Ab Initio There are many ways to do it.

  1. Probably the easiest way is to use Truncate Table
  2. Run Sql or update table can be used to do the same thing
  3. Run Program

Q.Have you eveer encountered an error called "depth not equal"? (This occurs when you extensively create graphs it is a trick question) Answer: When two components are linked together if their layout doesnot match then this problem can occur during the compilation of the graph. A solution to this problem would be to use a partitioning component in between if there was change in layout. have talked about a situation where you have linked 2 components - each of them having different layouts. Think about a situation where the components on the left hand side is linked to a serial dataset and on the right hand side the downstream component is linked to a multifile. Layout is going to be propagaed from naghbours. So without any partitioning component the jump in the depth cannot be achieved and I suppose you must need one partitioning component which can help alleviate this depth discrepancy. Q.What is the function you would use to transfer a string into a decimal? In this case no specific function is required if the size of the string and decimal is same. Just use decimal cast with the size in the transform function and will suffice. For example, if the source field is defined as string(8) and the destination as decimal(8) then (say the field name is field1). out.field :: (decimal(8)) in.field If the destination field size is lesser than the input then use of string_substring function can be used likie the following. say destination field is decimal(5). out.field :: (decimal(5))string_lrtrim(string_substring(in.field,1,5)) /* string_lrtrim used to trim leading and trailing spaces */ Hope this solution works. Q.How many parallelisms are in Abinitio? Please give a definition of each. Answer: There are 3 kinds of Parallelism: 1) Data Parallesim 2)Componnent Paralelism 3) Pipeline. When the data is divided into smalll chunks and processed on different components simultaneously we call it DataParallelism When different components work on different data sets it is called Component parallelism When a graph uses multiple components to run on the same data simultaneously we call it Pipeline parallelism Q.What is multi directory? Answer:A multi directory is a parallel directory that is composed of individual directories, typically on different disks or computers. The individual directories are partitions of the multi directory. Each multi directory contains one control directory and one or more data directories. Multi files are stored in multi directories. Q.What is multi file? Answer: A multi file is a parallel file that is composed of individual files, typically on different disks or computers. The individual files are partitions of the multi file. Each multi file contains one control partitions and one or more data partitions. Multi files are stored in distributed directories called multi directories. This diagram shows a multi directory and a multi file in a multi file system: The data in a multi file is usually divided across partitions by one of these methods: Random or round robin partitioning Partitioning based on ranges or functions Replication or broadcast, in which each partition is an identical copy of the serial data. Q.What is mean by GDE, SDE? What is purpose of GDE, SDE? Answer: GDE - Graphical Development Environment –it is used for developing the graphs SDE – Shell Development Environment, which is used for developing the korn shell script on co>operating system. Q.What is difference between Rollup and Scan ? Answer: Roll up comp: Rollup evaluates a group of input records that have the same key and then generates data records that either summarize each group or select certain information from each group. Using Rollup component can evaluates to two ways as follows: 1. Template mode 2. Expanded Mode

  1. Template Mode:

This mode options evaluates using built aggregation functions alike sum, min, max, count, avg, product, first, last.

  1. Expanded Mode:

This mode option can evaluates using (without aggregation functions) user defined functions alike temporary function, initialize, finalize and rollup functions in transform function propriety. Scan generates a series of cumulative summary records — such as successive year-to-date totals for groups of data records. Scan produces intermediate summary records. Rollup is for group by and Scan is for successive total. Basically, when we need to produce summary then we use scan. Rollup is used to aggregate data. Q.What is Runtime Behavior of Rollup? Answer: Roll up can supports two types of modes. 1.Template Mode: This mode options evaluates using built aggregation functions alike sum, min, max, count, avg, product, first, last.

  1. Expanded Mode:

This mode option can evaluates using (without aggregation functions) user defined functions alike temporary function, initialize, finalize and rollup functions in transform function propriety. Rollup component’s performance differs from using Rollup Input is Sorted and Rollup Input is Unsorted When Rollup Input is sorted When you set the sorted-input parameter to Input must be sorted or grouped (the default), Rollup requires data records grouped according to the key parameter. If you need to group the records, use Sort with the same key specifier that you use for Rollup. It will produces sorted outputs in output port. When Rollup Input is Unsorted When you set the sorted-input parameter to In memory: Input need not be sorted, Rollup accepts un grouped input, and groups all records according to the key parameter. It does not produce sorted output. Q.How do you do rollback in Ab-Initio? Answer:Ab-Initio has supports very good recovery options for any failures at runtime and interrupted powers at development time. Development time: You can get a recovery graph file while occurred any interrupted failures at development time. At Runtime: You can get a recovery file while occurred any failures at execution of graph and you can restart the execution. The recovery file has last checkpoint information and restarts from last checkpoint onwards. you can use two ways to rollback the Ab-Initio graphs m_rollback –d -deletes all intermediate files and checkpoints Q.What is internal execution (process) of the Ab-Initio graphs in Ab-Initio co>operating system on while running the graphs? Answer:Normally the Ab-Initio Co> operating system checks relevant code compatible of GDE and Co>operating system. if you are used any lookup files in graphs. This is called lookup layout checking. The graphs are having input and output files and it checks whether the path are correct or not, given below the sequence of process has done while running the graphs. Checks lookup files layouts. Checks meta data part (this is part check whether data types are used or not and related everything) – dml checking for each component basis. Checks input files Checks output files Checks each component’s layouts Finally, it checks flow of process assigns to straight. Q.What does dependency analysis mean in Ab-Initio? Answer: dependency analysis will answer the questions regarding data linage that is where does the data comes from and what applications produced depend on this data etc.. Q.What is meant by Fencing in Ab-Initio? Answer: In Software World fencing means job controlling on priority basis. In Ab-Initio it actually refers to customized phase breaking. A well fenced graph means no matter what is source data volume process will not cough in dead locks. It actually limits the number of simultaneous processes. In Ab-Initio you need to Fence the job in some times to stop the schedule. Fencing is nothing but changing the priority of the particular job. Q.What is the function of fuse component? Answer: Fuse combines multiple input flows into a single output flow by applying a transform function to corresponding records of each flow Runtime Behavior of Fuse Fuse applies a transform function to corresponding records of each input flow. The first time the transform function executes, it uses the first record of each flow. The second time the transform function executes, it uses the second record of each flow, and so on. Fuse sends the result of the transform function to the out port. The component works as follows. The component tries to read from each of its input flows. * If all of its input flows are finished, Fuse exits. * Otherwise, Fuse reads one record from each still-unfinished input port and a NULL from each finished input port. Q.what is data skew? how can you eliminate data skew while i am using partiiion by key? Answer: The skew of a data or flow partition is the amount by which its size deviates from the average partition size expressed as a percentage of the largest partition Skew of data (partition size - avg.partition size)*100/(size of largest partition) Q.What is $mpjret? Where it is used in ab-Initio? Answer: $mpjret gives the status of a graph. U can use $mpjret in end script like if 0 -eq($mpjret) then echo success else mailx -s failed mail_id Q.What are primary keys and foreign keys? Answer: In RDBMS the relationship between the two tables is represented as Primary key and foreign key relationship.Wheras the primary key table is the parent table and foreignkey table is the child table.The criteria for both the tables is there should be a matching column. Q.What is an outer join? Answer: An outer join is used when one wants to select all the records from a port - whether it has satisfied the join criteria or not. If you want to see all the records of one input file independent of whether there is a matching record in the other file or not. then its an outer join. Q.What are Cartesian joins? Answer: joins two tables without a join key. Key should be {}. A Cartesian join will get you a Cartesian product. A Cartesian join is when you join every row of one table to every row of another table. You can also get one by joining every row of a table to every row of itself. Q.What is the difference between a DB config and a CFG file? Answer: A .dbc file has the information required for Ab Initio to connect to the database to extract or load tables or views. While .CFG file is the table configuration file created by db_config while using components like Load DB Table. Both DBC and CFG files are used for database connectivity, basically both are of similar use. The only difference is, cfg file is used for Informix Database, whereas dbc are used for other database such as Oracle or Sqlserver Q.What is the difference between a Scan component and a RollUp component? Answer: Rollup is for group by and Scan is for successive total. Basically, when we need to produce summary then we use scan. Rollup is used to aggregate data.

  1. what is local and formal parameter?

Answer: Two are graph level parameters but in local you need to initialize the value at the time of declaration where as globle no need to initialize the data it will promt at the time of running the graph for that parameter. Q.How will you test a dbc file from command prompt ?? Answer: try "m_db test myfile.dbc" Q.Explain the difference between the “truncate” and "delete" commands ? Answer. Truncate :- It is a DDL command, used to delete tables or clusters. Since it is a DDL command hence it is auto commit and Rollback can't be performed. It is faster than delete. Delete:- It is DML command, generally used to delete a record, clusters or tables. Rollback command can be performed , in order to retrieve the earlier deleted things. To make deleted things permanently, "commit" command should be used. Q.How to retrive data from database to source in that case whice componenet is used for this? Answer. To unload (retrive) Data from the database DB2, Informix, or Oracle we have components like Input Table and Unload DB Table by using these two components we can unload data from the database. Q.How many components are there in your most complicated graph? Answer: This is a tricky question, number of component in a graph has nothing to do with the level of knowledge a person has. On the contrary, a proper standardized and modular parametric approach will reduce the number of components to a very few. In a well thought modular and parametric design, mostly the graphs will have 3/4 components, which will be doing a particular task and will then call another sets of graphs to do the next and so on. This way total numbers of distinct graphs will drastically come down, support and maintenance will be much more simplified. The bottom line is, there are lot more other things to plan rather than to add components. Q.Do you know what a local lookup is? Answer: This function is similar to a lookup...the difference being that this function returns NULL when there is no record having the value that has been mentioned in the arguments of the function. If it finfs the matching record it returns the complete record..that is all the fields along with their values corresponding to the expression mentioned in the lookup local function. eg: lookup_local( "LOOKUP_FILE",81) -> null if the key on which the lookup file is partitioned does not hold any value as mentioned. Local Lookup files are small files that can be accommodated into physical memory for use in transforms. Details like country code/country, Currency code/currency, forexrate/value can be used in a lookup file and mapped during transformations. Lookup files are not connected to any component of the graph but available to reformat for mapping. Q.How to Create Surrogate Key using Ab Initio? Ans. A key is a field or set of fields that uniquely identifies a record in a file or table. A natural key is a key that is meaningful in some business or real-world sense. For example, a social security number for a person, or a serial number for a piece of equipment, is a natural key. A surrogate key is a field that is added to a record, either to replace the natural key or in addition to it, and has no business meaning. Surrogate keys are frequently added to records when populating a data warehouse, to help isolate the records in the warehouse from changes to the natural keys by outside processes. Q.How to Improve Performance of graphs in Ab initio? Give some examples or tips. Ans. There are somany ways to improve the performance of the graphs in Abinitio. I have few points from my side. 1.Use MFS system using Partion by Round by robin. 2.If needed use lookup local than lookup when there is a large data. 3.Takeout unnecessary components like filter by exp instead provide them in reformat/Join/Rollup. 4.Use gather instead of concatenate. 5.Tune Max_core for Optional performance. 6.Try to avoid more phases. There are many ways the performance of the graph can be improved. 1) Use a limited number of components in a particular phase 2) Use optimum value of max core values for sort and join components 3) Minimise the number of sort components 4) Minimise sorted join component and if possible replace them by in-memory join/hash join 5) Use only required fields in the sort, reformat, join components 6) Use phasing/flow buffers in case of merge, sorted joins 7) If the two inputs are huge then use sorted join, otherwise use hash join with proper driving port 8) For large dataset don't use broadcast as partitioner 9) Minimise the use of regular expression functions like re_index in the trasfer functions 10) Avoid repartitioning of data unnecessarily Q.Describe the process steps you would perform when defragmenting a data table. This table contains mission critical data ? Answer: There are several ways to do this: 1) We can move the table in the same or other tablespace and rebuild all the indexes on the table. alter table move this activity reclaims the defragmented space in the table analyze table table_name compute statistics to capture the updated statistics. 2)Reorg could be done by taking a dump of the table, truncate the table and import the dump back into the table. Q.How do we handle if DML changing dynamically ? Answer: There are lot many ways to handle the DMLs which changes dynamically with in a single file. Some of the suitable methods are to use a conditional DML or to call the vector functionality while calling the DMLs. Q.What r the Graph parameter? Answer: There are 2 types of graph parameters in AbInitio 1. local parameter 2. Formal parameters.(those parameters working at runtime) Q.What is meant by fancing in abinitio ? Answer: The word Abinitio means from the beginning. Q.What is a ramp limit? Answer: Limit and Ramp. For most of the graph components, we can manually set the error threshold limit, after which the graph exits. Normally there are three levels of thresholds like "Never Exit" and "Exit on First Occurance", very clear from the text. They represent both the extremes. The third one is Limit along with Ramp. Limit talks about max limit where as RAMP talks in terms of percentage of processed records. For example a ramp value of 5 means, if less than 5% of the total records are rejected, continue running. If it crosses the ramp then it will come out of the graph. Typically development starts with never exit, followed by ramp and finally in production "Exit on First Occurance". Case to case basis RAMP can be used in production but definitely not a desired approach. Q.Difference between conventional loading and direct loading ? when it is used in real time ? Answer: Conventional Load: Before loading the data all the Table constraints will be checked against the data. Direct load:(Faster Loading) All the Constraints will be disabled. Data will be loaded directly. Later the data will be checked against the table constraints and the bad data won't be indexed. api conventional loading utility direct loading. Q.How do you done the unit testing in Ab-Initio? How will you perform the Ab-Initio Graphs executions? How will you increase the performance in Ab-Inito graphs? Answer: The Ab-Initio Co>operating system is handling the graph with multiple processes running simultaneously. This is primary performance. Follows the given below actions:

  1. The data separators mostly use “\307” and “\007” instead of “~”, “,” and special characters and avoids these delimiters. Because of the Ab-Initio has predefined these data separators.
  2. Avoids repeated aggregation in graphs. You calculate for required aggregation at once and stores in file calls value using parameters and then you can use this parameter where it required.
  3. Avoids the maximum number of components in graph and max core components in graphs.
  4. Don’t write any kinds looping statements in start script
  5. Mostly use the sources are flat files

Q.How do you improve the performance of a graph? Answer:There are many ways the performance of the graph can be improved.

  • Use a limited number of components in a particular phase
  • Use optimum value of max core values for sort and join components
  • Minimize the number of sort components
  • Minimize sorted join component and if possible replace them by in-memory join/hash join
  • Use only required fields in the sort, reformat, join components
  • Use phasing/flow buffers in case of merge, sorted joins
  • If the two inputs are huge then use sorted join, otherwise use hash join with proper driving port
  • For large dataset don't use broadcast as partitioner
  • Minimize the use of regular expression functions like re_index in the transfer functions
  • Avoid repartitioning of data unnecessarily

Q.How would you do performance tuning for already built graph? Answer:Steps to performance Tuning for already built graph.

  • Understand the functionality of the Graph.
  • Modularize(i.e,check for dependencies among components).
  • Give Phasing.
  • Check for correct Parallelism.
  • Check for DB component(i.e,take required data from DB. Instead of taking whole data from DB which consumes more time and memory.

Q.What is .abinitiorc ? What it contain? Answer:.abinitiorc is a file which contains the credentials to connect to host. Credentials like 1)Host IP 2)User-name 3)Password etc... This is a config file for ab-Initio - in user's home directory and in $AB_HOME/Config. It sets Ab-Initio home path, configuration variables (AB_WORK_DIR, AB_DATA_DIR, etc.), login info (id, encrypted password), login methods for hosts for execution (like EME host, etc.), etc. Q.Why might you create a stored procedure with the 'with recompile' option? Answer: Recompile is useful when the tables referenced by the stored procedure undergoes a lot of modification/deletion/addition of data. Due to the heavy modification activity the execute plan becomes outdated and hence the stored procedure performance goes down. If we create the stored procedure with recompile option, the sql server wont cache a plan for this stored procedure and it will be recompiled every time it is run. Q.What is the purpose of having stored procedures in a database? Answer:Main Purpose of Stored Procedure for reduce the network traffic and all sql statement executing in cursor so speed too high. We use Run SQL and Join with DB components to run Stored Procedures. Q.What is mean by Co>Operating system and why it is special for Ab-Initio? Answer: Co > Operating System:Layered top to the Native operating system. It converts the Ab-Initio specific code into the format, which the UNIX/Windows can understand and feeds it to the native operating system, which carries out the task. Q.How to retrieve data from database to source in that case which component is used for this? Answer: To unload (retrieve) Data from the database DB2, Informix, or Oracle we have components like Input Table and Unload DB Table by using these two components we can unload data from the database. Input Table Component use the following parameters: 1)db_config file(which contains credentials to interface with Database) 2)Database Types 3)SQL file (which contains sql queries to unload data from table(s)). Q.How to execute the graph from start to end stages?Tell me and how to run graph in non Ab-Initio system? Answer: There are so many ways to do this, 1.you can run components according to phases how you defined. 2.by creating ksh, sh scripts also you can run. Q.What is Join With DB? Answer: Join with DB Component joins records from the flow or flows connected to its in port with records read directly from a database, and outputs new records containing data based on, transform function. Q.How do you truncate a table? Answer: Use Truncate Table component to truncate a table from DB in Ab-Initio. Truncate Table Component has the following parameters: 1)db_config file(which contains credentials to interface with Database) 2)Database Types 3)SQL file (which contains sql queries to truncate table(s)). Q.Can we load multiple files? Answer: Yes,we can load multiple file in Ab-Initio. Q.What is the syntax of m_dump command? Answer: m_dump command prints the data in a formatted way. The general syntax is m_dump "m_dump meta data data " e.g m_dump emp.dml emp.dat -start 10 -end 20 – it will give record from 10 to 20 from emp.dat file. Q.How to Create Surrogate Key using Ab-Initio? Answer: A surrogate key is a substitution for the natural primary key. –It is just a unique identifier or number for each record like ROWID of an Oracle table. Surrogate keys can be created using 1)next_in_sequence 2)this_partition 3)no_of_partitions Q.Can any one give me an example of real-time start script in the graph? Answer: Start script is a script which gets executed before the graph execution starts. If we want to export values of parameters to the graph then we can write in start script then run the graph then those values will be exported to graph. Q.What is the difference between sandbox and EME, can we perform checkin and checkout through sandbox/ Can anybody explain checkin and checkout? Answer. Sandboxes are work areas used to develop, test or run code associated with a given project. Only one version of the code can be held within the sandbox at any time. The EME Datastore contains all versions of the code that have been checked into it. A particular sandbox is associated with only one Project where as a Project can be checked out to a number of sandboxes. Q.What is skew and skew measurement? Answer: skew is the mesaureof data flow to each partation . suppose i/p is comming from 4 files and size is 1 gb 1 gb= ( 100mb+200mb+300mb+5oomb) 1000mb/4= 250 mb (100- 250 )/500= --> -150/500 == cal ur self it wil come in -ve value. calclu for 200,500,300. +ve value of skew is allways desriable. skew is a indericet measure of graph. Q.What is the syntax of m_dump command? Answer: The genaral syntax is "m_dump metadata data " Q.What is the latest version that is available in Ab-initio? Answer: The latest version of GDE ism1.15 AND Co>operating system is 2.14 Q.What is the Difference between DML Expression and XFR Expression ? Answer: The main difference b/w dml & xfr is that DML represent format of the metadata. XFR represent the tranform functions.which will contain business rules Q.What are the most commonly used components in a Abinition graph? can anybody give me a practical example of a trasformation of data, say customer data in a credit card company into meaningful output based on business rules? Answer: The most commonly used components in to any Ab Initio project are input file/output file input table/output table lookup file reformat,gather,join,runsql,join with db,compress components,sort,trash,partition by expression,partition by key ,concatinate Q.Have you used rollup component? Describe how ? Answer: Rollup component can be used in different number of ways. It basically acts on a group of records based on a certain key. The simplest application would be to count the number of records in a certain file or table. In this case there would not be any "key" associated with it. A temp variable would be created for eg. 'temp.count' which would be increamented with every record ( since there is no key here all the fields are trated as one group) that flows through the transform, like temp.count=temp.count+1. Again the rollup component can be used to discard duplicates from a group.Rollup basically acting as the dedup component in this case.

  1. What is the difference between partitioning with key and round robin?

Answer: PARTITION BY KEY: In this, we have to specify the key based on which the partition will occur. Since it is key based it results in very well balanced data. It is useful for key dependent parallelism. PARTITION BY ROUND ROBIN: In this, the records are partitioned in sequential way, distributing data evenly in blocksize chunks across the output partition. It is not key based and results in well balanced data especially with blocksize of 1. It is useful for record independent parallelism. Q.How to work with parameterized graphs? Answer: One of the main purpose of the parameterized graphs is that if we need to run the same graph for n number of times for different files, we set up the graph parameters like $INPUT_FILE, $OUTPUT_FILE etc and we supply the values for these in the Edit>parameters.These parameters are substituted during the run time. we can set different types of parameters like positional, keyword, local etc. The idea here is, instead of maintaining different versions of the same graph, we can maintain one version for different files. Q.How Does MAXCORE works? Answer: Maxcore is a value (it will be in Kb).Whne ever a component is executed it will take that much memeory we specified for execution. Q.What does layout means in terms of Ab Initio? Answer: Before you can run an Ab Initio graph, you must specify layouts to describe the following to the Co>Operating System:

  • The location of files
  • The number and locations of the partitions of multifiles
  • The number of, and the locations in which, the partitions of program components execute

A layout is one of the following:

  • A URL that specifies the location of a serial file
  • A URL that specifies the location of the control partition of a multifile
  • A list of URLs that specifies the locations of:
    • The partitions of an ad hoc multifile
    • The working directories of a program component

Every component in a graph — both dataset and program components — has a layout. Some graphs use one layout throughout; others use several layouts and repartition data as needed for processing by a greater or lesser number of processors. During execution, a graph writes various files in the layouts of some or all of the components in it. For example:

  • An Intermediate File component writes to disk all the data that passes through it.
  • A phase break, checkpoint, or watcher writes to disk, in the layout of the component downstream from it, all the data passing through it.
  • A buffered flow writes data to disk, in the layout of the component downstream from it, when its buffers overflow.
  • Many program components — Sort is one example — write, then read and remove, temporary files in their layouts.
  • A checkpoint in a continuous graph writes files in the layout of every component as it moves through the graph.

Q.Can we load multiple files? Answer: Load multiple files from my perspective means writing into more than one file at a time. If this is the same case with you, Ab initio provides a component called Write Multiplefiles (in dataset Component group) which can write multiple files at a time. But the files which are to be written must be local files i.e., they should reside in your local PC. For more information on this component read in help file. Q.How would you do performance tuning for already built graph ? Can you let me know some examples? Answer: example :- suppose sort is used in fornt of merge component its no use of using sort ! 1)we have sort component built in merge. 2) we use lookup instead of JOIN,Merge Componenet. 3) suppose we wnt to join the data comming from 2 files and we dnt wnt dupliates we will use union funtion instead of adding addtional component for duplicate remover. Q.Which one is faster for processing fixed length dmls or delimited dmls and why ? Answer: Fixed length DML's are faster because it will directly read the data of that length without any comparisons but in delimited one,s every character is to be compared and hence delays. Q.What is the function you would use to transfer a string into a decimal? Answer: For converting a string to a decimal we need to typecast it using the following syntax, out.decimal_field :: ( decimal( size_of_decimal ) ) string_field; The above statement converts the string to decimal and populates it to the decimal field in output. Q.What is the importance of EME in ab initio? Answer: EME is a repository in Ab Inition and it used for checkin and checkout for graphs also maintains graph version. Q.How do you add default rules in transformer? Answer: Double click on the transform parameter of parameter tab page of component properties, it will open transform editor. In the transform editor click on the Edit menu and then select Add Default Rules from the dropdown. It will show two options - 1) Match Names 2) Wildcard. Q.What is data mapping and data modeling? Answer: data mapping deals with the transformation of the extracted data at FIELD level i.e. the transformation of the source field to target field is specified by the mapping defined on the target field. The data mapping is specified during the cleansing of the data to be loaded. For Example: source; string(35) name = "Siva Krishna "; target; string("01") nm=NULL("");/*(maximum length is string(35))*/ Then we can have a mapping like: Straight move.Trim the leading or trailing spaces. The above mapping specifies the transformation of the field nm. Q.Difference between conventional loading and direct loading ? when it is used in real time . Answer: Conventional Load: Before loading the data, all the Table constraints will be checked against the data. Direct load:(Faster Loading) All the Constraints will be disabled. Data will be loaded directly.Later the data will be checked against the table constraints and the bad data won't be indexed. Api conventional loading utility direct loading. Q.What are the contineous components in Abinitio? Answer: Contineous components used to create graphs,that produce useful output file while running continuously Ex:- Contineous rollup,Contineous update,batch subscribe Q.What is mean by Co > Operating system and why it is special for Ab-initio ? Answer: Co > Operating System: It converts the AbInitio specific code into the format, which the UNIX/Windows can understand and feeds it to the native operating system, which carries out the task. Q.How do you add default rules in transformer? Answer: Click to transformer then go to edit …then click to add default rule…… In Abinitio there is a concept called Rule Priority, in which you can assign priority to rules in Transformer. Let’s have a example: Ouput.var1 :1: input.var1 + 10 Ouput.var1 :2: 100 This example shows that output variable is assigned an input variable + 100 or if input variable do not have a value then default value 100 is set to the output variable. The numbers 1 and 2 represents the priority. Q.How to do we run sequences of jobs , like output of A JOB is Input to B,How do we co-ordinate the jobs? Answer: By writing the wrapper scripts we can control the sequence of execution of more than one job. Q.what is BRODCASTING and REPLICATE ? Answer: Broadcast - Takes data from multiple inputs, combines it and sends it to all the output ports. Eg - You have 2 incoming flows (This can be data parallelism or component parallelism) on Broadcast component, one with 10 records & other with 20 records. Then on all the outgoing flows (it can be any number of flows) will have 10 + 20 = 30 records Replicate - It replicates the data for a particular partition and send it out to multiple out ports of the component, but maintains the partition integrity. Eg - Your incoming flow to replicate has a data parallelism level of 2. with one partition having 10 records & other one having 20 recs. Now suppose you have 3 output flos from replicate. Then each flow will have 2 data partitions with 10 & 20 records respectively. Ab initio Interview Questions                           Ab initio Interview Questions and Answers Q.When using multiple DML statements to perform a single unit of work, is it preferable to use implicit or explicit transactions, and why. Answer: Because implicit is using for internal processing and explicit is using for user open data required. Q.What are kinds of layouts does ab initio supports Answer: Basically there are serial and parallel layouts supported by AbInitio. A graph can have both at the same time. The parallel one depends on the degree of data parallelism. If the multi-file system is 4-way parallel then a component in a graph can run 4 way parallel if the layout is defined such as it's same as the degree of parallelism. Q.What is the difference between look-up file and look-up, with a relevant example? Answer: A lookup is a component of abinitio graph where we can store data and retrieve it by using a key parameter. A lookup file is the physical file where the data for the lookup is stored. Q.How will you test a dbc file from command prompt? Answer: A .dbc file can be tested using m_db command eg: m_db test .dbc_filename Q.Can we merge two graphs? Answer: You can not merge two ab-Initio graphs. You can use the output of one graph as input for another. You can also copy/paste the contents between graphs. Q.Explain the differences between api and utility mode? Answer: api and Utility are Database Interfaces. api use SQL where table constrains are checked against the data before loading data into Database. Utility uses Bulk Loading where table constraints are disabled first and data loaded into Database and then table constraints are checked against data. Data loading using Utility is faster when compared to Api. if a crash occurs while loading data into database we can have commit and rollback in Api but we need to load whole in Utility mode. Q.How to Schedule Graphs in Ab-Initio,like work flow Schedule in Informatica? And where we must use Unix shell scripting in Ab-Initio? Answer: We can use Autosys, Control-M, or any other external scheduler to schedule graphs in Ab-Initio. We can take care of dependencies in many ways. For example, if scripts should run sequentially, we can arrange for this in Autosys, or we can create a wrapper script and put there several sequential commands (nohup command1.ksh & ; nohup command2.ksh &; etc). We can even create a special graph in Ab-Initio to execute individual scripts as needed. Q.What is Environment project in Ab-Initio? Answer: Environment project is a special public project that exists in every Ab-Initio environment. It contains all the environment parameters required by the private or public projects which constitute AI Standard Environment. Q.What is Component Folding?What is the use of it? Answer: Component Folding is a new feature by which Co>operating System combines a group of components and runs them as a single process. Component Folding improves the performance of graph. Pre-Requirements for Component Folding

  • The components must be foldable.
  • They must be in same phase and layout.
  • Components must be connected via straight flow

Q.How do you Debug a graph ,If an error occurs while running? Answer: There are many ways to debug a graph. we can use

  • Debugger
  • File Watcher
  • Intermediate File for debugging purpose.

Q.What do u mean by $RUN? Answer: This is parameter variable and it contains only path of project sandbox run directory. Instead of using hard-code value to use this parameter and this is default sandbox run directory parameter. fin -------> top-level directory ( $AI_PROJECT )

  • |---- mp -------> second level directory ($MP )
  • |---- xfr -------> second level directory ($XFR )
  • |---- run --------> second level directory ($RUN )
  • |---- dml -------> second level directory ($DML )

Q.What is the importance of EME in ab-Initio? Answer: EME is a repository in Ab-Initio and it used for check-in and checkout for graphs also maintains graph version. EME is source code control system in Ab-Initio world. It is repository where all the sandboxes related(project related codes(graphs version are maintained) code version are maintained , we just check-in and checkout graphs and modified it according. There will be lock put once it is access by any users. Q.What is the difference between sandbox and EME, can we perform check-in and checkout through sandbox/ Can anybody explain check-in and checkout? Answer: Sandboxes are work areas used to develop test or run code associated with a given project. Only one version of the code can be held within the sandbox at any time. The EME Data-store contains all versions of the code that have been checked into it. A particular sandbox is associated with only one Project where as a Project can be checked out to a number of sandboxes. Q.What is difference between sandbox parameters and graph parameters? Answer: Sandbox Parameters are common parameters for the project. it can be used to accessible with in a project. The graph parameters are uses with in graph but you can't access outside of other graphs. It’s called local parameters. Q.How do you connect EME to Ab-Initio Server? Answer:There are several ways of connecting to EME

  • Set AB_AIR_ROOT
  • GDE you can connect to EME data-store
  • login to eme web interface
  • using the air command, i don't know much about this.

Q.What is use of co>operating system between GDE and Host? Answer: The co>operating system is heart of GDE, It always referring the host setting, environmental variable and functions while running the graphs through GDE. It's interfacing the connection setting information between HOST and GDE. Q.What is the use of Sandbox ? What is it.? Answer: Sandbox is a directory structure of which each directory level is assigned a variable name, is used to manage check-in and checkout of repository based objects such as mp, run, dml, db, xfr and sql (graphs, graph ksh files, wrapper scripts, dml files, xfr files, dbc files, sql files.) Fin -------> top-level directory ( $AI_PROJECT ) |---- mp -------> second level directory ($AI_MP ) |---- xfr -------> second level directory ($AI_XFR ) |---- run --------> second level directory ($AI_RUN ) |---- dml -------> second level directory ($AI_DML ) Sandbox contains various directories, which is used for specific purpose only. The mp directory is used for storing data mapping details about between sources and targets or components and the file extension must be *.mp. The xfr directory denotes purpose of stores the transform files and the file extension must be *.xfr. The dml directory is used for storing all meta-data information of data with Ab-Initio supported data types and the file extensions must be *.dml. The run directory contains only the graph’s shell script (korn shell script) files that are created after deploying the graph. The sandbox contains might be stores all kinds of information for data. Q.What is mean by EME Data Store and what is use of EME Data Store in Enterprise world? Answer: EME Data Store is a Enterprise Meta Environment Data store (Enterprise Repository) and its contains ’n’ number of projects (sandbox) which are interfacing the meta data between them. These sandbox project objects (mp, run, db, xfr, dml) are can be easily to manage the check-in, checked out of the repository objects. Mode: In the EME Data-store Mode box of the EME Data-store Settings dialog, choose one of the following: Source Code Control — This is the recommended setting. When you set a data-store to this mode, you must check out a project in order to work on it. This prevents multiple users from making conflicting changes to a project. Full Access — This setting is strongly not recommended. It is for advanced users only. It allows you to edit a project in the data-store without checking it out. Save Script When Graph Saved to Sandbox In the EME Data-store Settings dialog, select this option to have the GDE save the script it generates for a graph when you save the graph. The script lets you run the graph without the GDE if, for example, you relocate the project. ab initio interview questions