Top Pentaho Interview Questions and Answers

Last updated on Feb 18 2022
Anudhati Reddy

Table of Contents

Top Pentaho Interview Questions and Answers

What is Pentaho?

It addresses the blockades that block the organization’s ability to get value from all our data. Pentaho is discovered to ensure that each member of our team from developers to business users can easily convert data into value.

Define the Pentaho BI Project?

The Pentaho BI Project is a current effort by the Open Source communal to provide groups with best-in-class solutions for their initiative Business Intelligence (BI) needs.

What major applications comprised of Pentaho BI Project?

 The Pentaho BI Project encompasses the following major application areas:

  • Business Intelligence Platform
  • Data Mining
  • Reporting
  • Dashboards
  • Business Intelligence Platform

Is Pentaho a Trademark?

 Yes, Pentaho is a trademark.

What do you understand by Pentaho Metadata?

 Pentaho Metadata is a piece of the Pentaho BI Platform designed to make it easier for users to access information in business terms.

What kind of data, cube contain?

 The Cube will contain the following data:

  • 3 Fact fields –Sales, Costs, and Discounts
  • Time Dimension –with the following hierarchy: Year, quarter, and Month
  • 2 Customer Dimensions –one with location (Region, Country) and the other with Customer Group and Customer Name
  • Product Dimension –containing a Product Name

How to do a database join with PDI?

  • If we want to join 2 tables from the same database, we can use a “Table Input” step and do the join in SQL itself.
  • If we want to join 2 tables that are not in the same database. We can use “Database Join”.

How do you insert booleans into a MySql database, PDI encodes a boolean as ‘Y’ or ‘N’ and thus can’t be inserted into a BIT(1) column in MySql.

  • BIT is not a standard SQL data type. It’s not even standard on MySQL as the meaning (core definition) changed from MySQL version 4 to 5.
  • Also, a BIT uses 2 bytes on MySQL. That’s why in PDI we made the safe choice and went for a char(1) to store a boolean.

There is a simple workaround available: change the data type with a Select Values step to “Integer” in the metadata tab. This converts it to 1 for “true” and 0 for “false”, just like MySQL expects.

What are the benefits of Pentaho?

  • Open Source
  • Have a community that support the users
  • Running well under multi-platform (Windows, Linux, Macintosh, Solaris, Unix, etc)
  • Have complete package from reporting, ETL for warehousing data management,
  • OLAP server data mining also a dashboard.

What are the applications of Pentaho?

1.Suite Pentaho

  • BI Platform (JBoss Portal)
  • Pentaho Dashboard
  • JFreeReport
  • Mondrian
  • Kettle
  • Weka
  1. All build under Java platform

Define Pentaho Data mining?

 Pentaho Data Mining used the Waikato Environment for Information Analysis to search for data for patterns. It has functions for data processing, regression analysis, classification methods, etc.

Brief about the Pentaho Report designer?

 It is a visual, banded report writer. It has various features like using subreports, charts, and graphs, etc.

Explain the Encrypting File system?

 It is the technology that enables files to be transparently encrypted to secure personal data from attackers with physical access to the computer.

What do you mean by repository?

 Repository is a storage location where we can store the data safely without any harmless.

What is metadata?

 The metadata stored in the repository by associating the information with individual objects in the repository.

What are snapshots?

 Snapshots are read-only copies of a master table located on a remote node which can be periodically refreshed to reflect changes made to the master table.

Data staging is actually a group of procedures used to prepare source system data for loading a data warehouse.

  • Full Load means completely erasing the insides of one or more tables and filling with fresh data.
  • Incremental Load means applying ongoing changes to one or more tables based on a predefined schedule.

Define mapplet?

 It creates and configures the set of transformation.

What do you understand by a three-tier data warehouse?

 A data warehouse is said to be a three-tier system where a middle system provides usable data in a secure way to end-users. Both sides of this middle system are the end-users and the back-end data stores.

What are the different versions of Informatica?

 Informatica Powercenter 4.1, Informatica Powercenter 5.1, Powercenter Informatica 6.1.2, Informatica Powercenter 7.1.2, etc.

What are the various tools in ETL?

 Abinitio,DataStage, Informatica, Cognos Decision Stream, etc

Define MDX?

 MDX is a multidimensional expression which is the main query language implemented by the Mondrian’s.

How do you duplicate a field in a row in a transformation?

 Several solutions exist:

  • Use a “Select Values” step renaming a field while selecting also the original one. The result will be that the original field will be duplicated to another name.

It will look as follows:

  • This will duplicate fieldA to fieldB and fieldC.

Use a calculator step and use e.g. The NLV(A, B) operation as follows:

  • This will have the same effect as the first solution: 3 fields in the output which are copies of each other: fieldA, fieldB, and fieldC.

Use a JavaScript step to copy the field:

  • This will have the same effect as the previous solutions: 3 fields in the output which are copies of each other: fieldA, fieldB, and fieldC.

We will be using PDI integrated into a web application deployed on an application server. We’ve created a JNDI data source in our application server. Of course, Spoon doesn’t run in the context of the application server, so how can we use the JNDI data source in PDI?

  • If you look in the PDI main directory you will see a sub-directory “simple-jndi”, which contains a file called “jdbc.properties”. You should change this file so that the JNDI information matches the one you use in your application server.
  • After that you set in the connection tab of Spoon the “Method of access” to JNDI, the “Connection type” to the type of database you’re using. And “Connection name” to the name of the JDNI data source (as used in “jdbc.properties”.

The Text File Input step has a Compression option that allows you to select Zip or Gzip, but it will only read the first file in Zip. How can I use Apache VFS support to handle tarballs or multi-file zips?

 The catch is to specifically restrict the file list to the files inside the compressed collection.  Some examples:
You have a file with the following structure:

  • logs.tar.gz
  • log.1
  • log.2
  • log.3

To read each of these files in a File Input step:

File/Directory Wildcard
tar:gz:/path/to/access.logs.tar.gz!/access.logs.tar! .+

You have a simpler file, fat-access.log.gz.  You could use the Compression option of the File Input step to deal with this simple case, but if you wanted to use VFS instead, you would use the following specification:

 Note: If you only wanted certain files in the tarball, you could certainly use a wildcard like access.log..* or something.  .+ is the magic if you don’t want to specify the children’s filenames.  .* will not work because it will include the folder (i.e. tar:gz:/path/to/access.logs.tar.gz!/access.logs.tar!/ )

File/Directory Wildcard
gz:file://c:/path/to/fat-access.log.gz! .+

Finally, if you have a zip file with the following structure:

1

2

3

4

5

6

7

8

access.logs.zip/

a-root-access.log

subdirectory1/

subdirectory-access.log.1

subdirectory-access.log.2

subdirectory2/

subdirectory-access.log.1

subdirectory-access.log.2

You might want to access all the files, in which case you’d use:

File/Directory Wildcard
zip:file://c:/path/to/access.logs.zip!  

a-root-access.log
zip:file://c:/path/to/access.logs.zip!/subdirectory1
subdirectory-access.log.*
zip:file://c:/path/to/access.logs.zip!/subdirectory2
subdirectory-access.log.*

Compare Pentaho Tableau

Criteria Pentaho Tableau
Functionality ETL, OLAP, & static Reports Data analytics
Availability Open source Proprietary
Strengths Data Integration Interactive visualizations

Define Pentaho and its usage.

Revered as one of the most efficient and resourceful data integration tools (DI), Pentaho virtually supports all available data sources and allows scalable data clustering and data mining. It is a light-weight Business Intelligence suite executing Online Analytical Processing (OLAP) services, ETL functions, reports and dashboards creation and other data-analysis and visualization operations.

Name major applications comprising Pentaho BI Project.

  • Business Intelligence Platform.
  • Dashboards and Visualizations.
  • Data Mining.
  • Data Analysis.
  • Data Integration and ETL (also called Kettle.
  • Data Discovery and Analysis (OLAP.

What is the importance of metadata in Pentaho?

A metadata model in Pentaho formulates the physical structure of your database into a logical business model. These mappings are stored in a central repository and allow developers and administrators to build business-logical DB tables that are cost effective and optimized. It further simplifies the working of business users allowing them to create formatted reports and dashboards ensuring security to data access.
All in all, metadata model provides an encapsulation around the physical definitions of your database and the logical representation and define relationships between them.

What is MDX and its usage?

MDX is an acronym for ‘Multi-Dimensional Expressions,’ the standard query language introduced by Microsoft SQL OLAP Services. MDX is an imperative part of XML for analysis API, which has a different structure than SQL. A basic MDX query is:

SELECT {[quantity].[Unit Sales], [quantity].[Store Sales]} ON COLUMNS,

{[Product].members} ON ROWS

FROM [Sales]

WHERE [Time].[1999].[2]

Define three major types of Data Integration Jobs.

  • Transformation Jobs :Used for preparing data and used only when the there is no change in data until transforming of data job is finished.
  • Provisioning Jobs :Used for transmission/transfer of large volumes of data. Used only when no change is data is allowed unless job transformation and on large provisioning requirement.
  • Hybrid Jobs :Execute both transformation and provisioning jobs. No limitations for data changes; it can be updates regardless of success/failure. The transforming and provisioning requirements are not large in this case.

Explain how to sequentialize transformations?

Since PDI transformations support parallel execution of all the steps/operations, it is impossible to sequentialize transformations in Pentaho. Moreover, to make this happen, users need to change the core architecture, which will actually result in Slow processing.

Explain Pentaho Reporting Evaluation.

Pentaho Reporting evaluation is a complete package of its reporting abilities, activities and tools, specifically designed for first-phase evaluation like accessing the sample, generating and updating reports, viewing them and performing various interactions. This evaluation consists of Pentaho platform components, Report Designer and ad hoc interface for reporting used for local installation.

How to use database connections from repository?

You can either create a new transformation/job or close and reopen the ones already loaded in Spoon.

Explain in brief the concept Pentaho Dashboard.

Dashboards are the collection of various information objects on single page including diagrams, tables and textual information. The Pentaho AJAX API is used to extract BI information while Pentaho Solution Repository contains the content definitions.

The steps involved in Dashboard creation include

  • Adding dashboard to the solution.
  • Defining dashboard content.
  • Implementing filters.
  • Editing dashboards.

Explain Pentaho report Designer (PRD.

PRD is a graphic tool to execute report-editing functions and create simple and advanced reports and help users export them in PDF, Excel, HTML and CSV files. PRD consists of Java-based report engine offering data integration, portability and scalability. Thus, it can be embedded in Java web applications and also other application servers like Pentaho BAserver.

Define Pentaho Report types.

There are several categories of Pentaho reports : 

  • Transactional Reports :Data to be used form transactions. Objective is to publish detailed and comprehensive data for day-to-day organization’s activities like purchase orders, sales reporting.
  • Tactical Reports :data comes from daily or weekly transactional data summary. Objective is to present short-term information for instant decision making like replacing merchandize.
  • Strategic Reports :data comes from stable and reliable sources to create long-term business information reports like season sales analysis.
  • Helper Reports :data comes from various resources and includes images, videos to present a variety of activities.

Explain MDX?explain

Multidimensional Expressions (MDX) is a query language for OLAP databases, much like SQL is a query language for relational databases. It is also a calculation language, with syntax similar to spreadsheet formulas.

Define Tuple?

Finite ordered list of elements is called as tuple.

Differentiate between transformations and jobs?

Transformations is moving and transforming rows from source to target.
Jobs are more about high level flow control.

How to do a database join with PDI?

If we want to join 2 tables from the same database, we can use a “Table Input” step and do the join in SQL itself.

How do you insert booleans into a MySQL database, PDI encodes a boolean as ‘Y’ or ‘N’ and this can’t be insert into a BIT(1) column in MySql.

BIT is not a standard SQL data type. It’s not even standard on MySQL as the meaning (core definition) changed from MySQL version 4 to 5.
Also a BIT uses 2 bytes on MySQL. That’s why in PDI we made the safe choice and went for a char(1) to store a boolean. There is a simple workaround available: change the data type with a Select Values step to “Integer” in the metadata tab. This converts it to 1 for “true” and 0 for “false”, just like MySQL expects.

What are the benefits of Pentaho?

  • Open Source
  • Have community that support the users
  • Running well under multi-platform (Windows, Linux, Macintosh, Solaris, Unix, etc)
  • Have complete package from reporting, ETL for warehousing data management,
  • OLAP server data mining also dashboard.

What are the applications of Pentaho?

i)Suite Pentaho

  • BI Platform (JBoss Portal)
  • Pentaho Dashboard
  • JFreeReport
  • Mondrian
  • Kettle
  • Weka

ii)All build under Java platform

Explain Encrypting File system?

It is the technology which enables files to be transparently encrypted to secure personal data from attackers with physical access to the computer.

What is ETL process? Write the steps also?.

ETL is extraction , transforming , loading process the steps are :

1 – define the source
2 – define the target
3 – create the mapping
4 – create the session
5 – create the work flow

Define mapping?

Dataflow from source to target is called as mapping.

Explain session?

It is a set of instruction which tell when and how to move data from respective source to target.

What is Workflow?

It is a set of instruction which tell the informatica server how to execute the task.

What is XML?

XML is an extensible mark-up language which defines a set of rule for encoding documents in both formats which is human readable and machine readable.

What are the different versions of informatica?

Informatica Powercenter 4.1, Informatica Powercenter 5.1, Powercenter Informatica 6.1.2, Informatica Powercenter 7.1.2, etc.

What are various tools in ETL?

Abinitio,DataStage, Informatica, Cognos Decision Stream, etc

Define MDX?

MDX is multi- dimensional expression which is a main query language implemented by the Mondrains.

Define multi-dimensional cube?

It is a cube to view data where we can Slice and dice the data. It have time dimension, locations and figures.

How do you duplicate a field in a row in a transformation?

Several solutions exist:

Use a “Select Values” step renaming a field while selecting also the original one. The result will be that the original field will be duplicated to another name. It will look as follows:

This will duplicate fieldA to fieldB and fieldC.

Use a calculator step and use e.g. The NLV(A,B) operation as follows:

This will have the same effect as the first solution: 3 fields in the output which are copies of each other: fieldA, fieldB, and fieldC.

Use a JavaScript step to copy the field:

This will have the same effect as the previous solutions: 3 fields in the output which are copies of each other: fieldA, fieldB, and fieldC.

Explain Pentaho?

It addresses the blockades that block the organization’s ability to get value from all our data. Pentaho is discovered to ensure that each member of our team from developers to business users can easily convert data into value.

What major applications comprise of Pentaho BI Project?

 The Pentaho BI Project encompasses the following major application areas:

Business Intelligence Platform

Data Mining

Reporting

Dashboards

Business Intelligence Platform

Is Pentaho a Trademark?

 Yes, Pentaho is a trademark.

What do you understand by Pentaho Metadata?

 Pentaho Metadata is a piece of the Pentaho BI Platform designed to make it easier for users to access information in business terms.

How does Pentaho Metadata work?

 With the help of Pentaho’s open-source metadata capabilities, administrators can outline a layer of abstraction that presents database information to business users in familiar business terms.

Explain the use of Pentaho reporting?

 Pentaho reporting enables businesses to create structured and informative reports to easily access, format, and deliver meaningful and important information to clients and customers. They also help business users to analyze and track consumer’s behavior for the specific time and functionality, thereby directing them towards the right success path.

What is Pentaho Data Mining?

 Pentaho Data Mining refers to the Weka Project, which consists of a detailed toolset for machine learning and data mining. Weka is open-source software for extracting large users of information about users, clients, and businesses. It is built on Java programming.

Define Pentaho Report types?

 There are several categories of Pentaho reports:

-Transactional Reports: Data to be used form transactions. The objective is to publish detailed and comprehensive data for day-to-day organization’s activities like purchase orders, sales reporting.

-Tactical Reports: data comes from daily or weekly transactional data summary. The objective is to present short-term information for instant decision making like replacing merchandise.

-Strategic Reports: data comes from stable and reliable sources to create long-term business information reports like season sales analysis.

-Helper Reports: data comes from various resources and includes images, videos to present a variety of activities.

What are the variables and arguments in transformations?

 The transformations dialog box consists of two different tables: one of the arguments and the other variables. While arguments refer to the command line specified during batch processing, PDI variables refer to objects that are set in a previous transformation/job in the OS.

How to perform a database join with PDI (Pentaho Data Integration)?

 PDI supports the joining of two tables forms the same database using a ‘Table Input’ method, performing the join in SQL only. On the other hand, for joining two tables in different databases, users implement the ‘Database Join’ step. However, in the database join, each input row query executes on the target system from the mainstream, resulting in lower performance as the number of queries implement on the B increases. To avoid the above situation, there is yet another option to join rows form two different Table Input steps. You can use the ‘Merge Join ‘step, using the SQL query having the ‘ORDER BY’ clause. Remember, the rows must be perfectly sorted before implementing the merge join.

Explain how to sequential transformations?

 Since PDI transformations support parallel execution of all the steps/operations, it is impossible to sequential transformations in Pentaho. Moreover, to make this happen, users need to change the core architecture, which will actually result in Slow processing.

Define the term “Encrypting File system”?

Encrypting file system is the technology that enables files to be transparently encrypted to secure personal data from attackers with physical access to the computer.

What is the Workflow?

Workflow is a set of instruction which tells the Informatica server how to execute the task.

What is the importance of metadata in Pentaho?

A metadata model in Pentaho formulates the physical structure of your database into a logical business model. These mappings are stored in a central repository and allow developers and administrators to build business-logical DB tables that are cost-effective and optimized. It further simplifies the working of business users allowing them to create formatted reports and dashboards ensuring security to data access.
All in all, the metadata model provides an encapsulation around the physical definitions of your database and the logical representation and defines relationships between them.

How to perform database join with PDI (Pentaho Data Integration)?

PDI supports joining of two tables form the same database using a ‘Table Input’ method, performing the join in SQL only.
On the other hand, for joining two tables in different databases, users implement ‘Database Join’ step. However, in a database join, each input row query executes on the target system from the mainstream, resulting in lower performance as the number of queries implement on the B increases.
To avoid the above situation, there is yet another option to join rows from two different Table Input steps. You can use ‘Merge Join ‘step, using the SQL query having ‘ORDER BY’ clause. Remember, the rows must be perfectly sorted before implementing merge join.

Explain Hierarchy Flattening.

It is just the construction of parent-child relationships in a database. Hierarchy Flattening uses both horizontal and vertical formats, which enables easy and trouble-free identification of sub-elements. It further allows users to understand and read the main hierarchy of BI and includes Parent column, Child Column, Parent attributes, and Child attributes.

Which platform benefits from the Pentaho BI Project?

Java developers who generally use project components to rapidly assemble custom BI solutions
ISVs who can improve the value and ability of their solutions by embedding BI functionality
End-Users who can quickly deploy packaged BI solutions which are either modest or greater to traditional commercial offerings at a dramatically lower cost.

What are variables and arguments in transformations?

Transformations dialog box consists of two different tables: one of arguments and the other of variables. While arguments refer to command line specified during batch processing, PDI variables refer to objects that are set in a previous transformation/job in the OS.

How to configure JNDI for Pentaho DI Server?

Pentaho offers JNDI connection configuration for local DI to avoid continuous running of application server during the development and testing of transformations. Edit the properties in jdbc.properties file located at…\data-integration-server\pentaho-solutions\system\simple-jndi.

Mention the major features of Pentaho?

  • Direct Analytics on MongoDB:It authorizes business analysts and IT to access, analyze, and visualize MongoDB data.
  • Science Pack:Pentaho’s Data Science Pack operationalizes analytical modeling and machine learning while allowing data scientists and developers to unburden the labor of data preparation to Pentaho Data Integration.
  • Full YARN Support for Hadoop:Pentaho’s YARN mixing enables organizations to exploit the full computing power of Hadoop while leveraging existing skillsets and technology investments.

Which platform benefits from the Pentaho BI Project?

  • Java developers who generally use project components to rapidly assemble custom BI solutions
  • ISVs who can improve the value and ability of their solutions by embedding BI functionality
  • End-Users who can quickly deploy packaged BI solutions which are either modest or greater to traditional commercial offerings at a dramatically lower cost

How does Pentaho Metadata work?

 With the help of Pentaho’s open-source metadata capabilities, administrators can outline a layer of abstraction that presents database information to business users in familiar business terms.

What is Pentaho Reporting Evaluation?

 Pentaho Reporting Evaluation is a particular package of a subset of the Pentaho Reporting capabilities, designed for typical first-phase evaluation activities such as accessing sample data, creating and editing reports, and viewing and interacting with reports.

Explain MDX?explain

 Multidimensional Expressions (MDX) is a query language for OLAP databases, much like SQL is a query language for relational databases. It is also a calculation language, with syntax similar to spreadsheet formulas.

Define Tuple?

 Finite ordered list of elements is called a tuple.

Differentiate between transformations and jobs?

  • Transformations are moving and transforming rows from source to target.
  • Jobs are more about high-level flow control.

How to sequential transformations?

 It is not possible as in PDI transformations all of the steps run in parallel. So we can’t sequential them.

How we can use database connections from the repository?

 We can create a new conversion or close and re-open the ones we have loaded in Spoon.

By default all steps in a transformation run in parallel, how can we make it so that 1 row gets processed completely until the end before the next row is processed?

 This is not possible as in PDI transformations all the steps run in parallel. So we can’t sequential them. This would require architectural changes to PDI and sequential processing also result in very Slow processing.

Why can’t we duplicate field names in a single row?

 We can’t. if we have duplicate field names. Before PDI v2.5.0 we were able to force duplicate fields, but also only the first value of the duplicate fields could ever be used.

Differentiate between Arguments and variables?

  • Arguments are command-line arguments that we would normally specify during batch processing.
  • Variables are environment or PDI variables that we would normally set in a previous transformation in a job.

What do you understand by the term Pentaho Dashboard?

 Pentaho Dashboards give business users the critical information they need to understand and improve organizational performance.

What is the use of Pentaho reporting?

 Pentaho Reporting allows organizations to easily access, format, and deliver information to employees, customers, and partners.

Define Pentaho Schema Workbench?

 Pentaho Schema Workbench offers a graphical edge for designing OLAP cubes for Pentaho Analysis.

What do you understand by the term ETL?

 It is an entry-level tool for data manipulation.

What do you understand by hierarchical navigation?

 A hierarchical navigation menu allows the user to come directly to a section of the site several levels below the top.

What are the steps to Decrypt a folder or file?

  • Right-click on the folder or file we want to decrypt, and then click on the Properties option.
  • Click the General tab, and then click Advanced.
  • Clear the Encrypt contents to secure data checkbox, click OK, and then click OK again.

Explain why we need the ETL tool?

 ETL Tool is used to getting data from many source systems like RDBMS, SAP, etc. and convert them based on the user requirement. It is required when data float across many systems.

What is the ETL process? Write the steps also?

 ETL is an extraction, transforming, loading process the steps are :

  • define the source
  • define the target
  • create the mapping
  • create the session
  • create the workflow

What is data staging?

 Data staging is actually a group of procedures used to prepare source system data for loading a data warehouse.

Define mapping?

 Dataflow from source to target is called mapping.

 Explain the session?

 It is a set of instruction which tells when and how to move data from respective source to target.

What is Workflow?

 It is a set of instruction which tells the Informatica server how to execute the task.

What is ODS?

 ODS is an Operational Data Store that comes in between data warehouse and staging.

Differentiate between the Etl tool and OLAP tool?

 ETL Tool is used for extracting data from the legacy system and load it into the specified database with some processing of cleansing data.
OLAP Tool is used for the reporting process. Here data is available in the multidimensional model hence we can write a simple query to extract data from the database.

Who is XML?

 XML is an extensible markup language which defines a set of rule for encoding documents in both formats which is human readable and machine readable.

Define multi-dimensional cube?

 It is a cube to view data where we can Slice and dice the data. It has a time dimension, locations, and figures.

Why can’t I duplicate field names in a single row?

 You can’t. PDI will complain in most of the cases if you have duplicate field names. Before PDI v2.5.0 you were able to force duplicate fields, but also only the first value of the duplicate fields could ever be used.

I’ve got a transformation that doesn’t run fast enough, but it is hard to tell in what order to optimize the steps. What should I do?

  • Transformations stream data through their steps.
  • That means that the Slowest step is going to determine the speed of a transformation.
  • So you optimize the Slowest steps first. How can you tell which step is the Slowest: look at the size of the input buffer in the log view.
  • In the latest 3.1.0-M1 nightly build you will also find a graphical overview of this: HTTP://WWW.IBRIDGE.BE/?P=92
  • (the “graph” button at the bottom of the log view will show the details.
  • A Slow step will have consistently large input buffer sizes. A fast step will consistently have low input buffer sizes.

Explain Pentaho Data Integration architecture?

 Note: For some reason, the.+ doesn’t work in the subdirectories, they still show the directory entries. :/

bi pic5

The spoon is the design interface for building ETL jobs and transformations. Spoon provides a drag-and-drop interface that allows you to graphically describe what you want to take place in your transformations. Transformations can then be executed locally within Spoon, on a dedicated Data Integration Server, or a cluster of servers.

The Data Integration Server is a dedicated ETL server whose primary functions are:

Execution
Executes ETL jobs and transformations using the Pentaho Data Integration engine
Security
Allows you to manage users and roles (default security) or integrate security to your existing security provider such as LDAP or Active Directory
Content Management
Provides a centralized repository that allows you to manage your ETL jobs and transformations. This includes full revision history on content and features such as sharing and locking for collaborative development environments.
Scheduling
Provides the services allowing you to schedule and monitor activities on the Data Integration Server from within the Spoon design environment.

Pentaho Data Integration is composed of the following primary components:

  • Introduced earlier, Spoon is a desktop application that uses a graphical interface and editor for transformations and jobs. Spoon provides a way for you to create complex ETL jobs without having to read or write code. When you think of Pentaho Data Integration as a product, Spoon is what comes to mind because, as a database developer, this is the application on which you will spend most of your time. Any time you author, edit, run, or debug a transformation or job, you will be using Spoon.
  • A standalone command line process that can be used to execute transformations and jobs you created in Spoon. The data transformation engine Pan reads data from and writes data to various data sources. Pan also allows you to manipulate data.
  • A standalone command line process that can be used to execute jobs. The program that executes the jobs designed in the Spoon graphical interface, either in XML or in a database repository. Jobs are usually scheduled to run in batch mode at regular intervals.
  • Carte. Carte is a lightweight Web container that allows you to set up a dedicated, remote ETL server. This provides similar remote execution capabilities as the Data Integration Server but does not provide scheduling, security integration, and a content management system

bi pic6

Explain the important features of Pentaho.

  • Pentaho is capable of creating Advanced Reporting Algorithms regardless of their input and output data format.
  • It supports various report formats, whether Excel spreadsheets, XMLs, PDF docs, CSV files.
  • It is a Professionally Certified DI Software rendered by the renowned Pentaho Company headquartered in Florida, United States.
  • Offers enhanced functionality and in-Hadoop functionality.
  • Allows dynamic drill down into larger and greater information.
  • Rapid Interactive response optimization.
  • Explore and view multidimensional data.

Define Pentaho Reporting Evaluation.

Pentaho Reporting Evaluation is a particular package of a subset of the Pentaho Reporting capabilities, designed for typical first-phase evaluation activities such as accessing sample data, creating and editing reports, and viewing and interacting with reports.

Explain the benefits of Data Integration.

  • The biggest benefit is that integrating data improves consistency and reduces conflicting and erratic data from the DB. Integration of data allows users to fetch exactly what they look for, enabling them utilize and work with what they collected.
  • Accurate data extraction, which further facilitates flexible reporting and monitoring of the available volumes of data.
  • Helps meet deadlines for effective business management.
  • Track customer’s information and buying behavior to improve traffic and conversions in the future, thus advancing your business performance.

Illustrate the difference between transformations and jobs.

While transformations refer to shifting and transforming rows from source system to target system, jobs perform high level operations like implementing transformations, file transfer via FTP, sending mails, etc.
Another significant difference is that the transformation allows parallel execution whereas jobs implement steps in order.

How to perform database join with PDI (Pentaho Data Integration)?

PDI supports joining of two tables form the same database using a ‘Table Input’ method, performing the join in SQL only.
On the other hand, for joining two tables in different databases, users implement ‘Database Join’ step. However, in database join, each input row query executes on the target system from the main stream, resulting in lower performance as the number of queries implement on the B increases.
To avoid the above situation, there is yet another option to join rows form two different Table Input steps. You can use ‘Merge Join ‘step, using the SQL query having ‘ORDER BY’ clause. Remember, the rows must be perfectly sorted before implementing merge join.

Can fieldnames in a row duplicated in Pentaho?

No, Pentaho doesn’t allow field duplication.

Does transformation allow filed duplication?

“Select Values” will rename a field as you select the original field also.  The original field will have a duplicate name of the other field now.

How to use logic from one transformation/job in other process?

Transformation logic can be shared using sub transformations, which provides seamless loading and transformation of variables enhancing efficiency and productivity of the system. Sub transformations can be called and reconfigured when required.

Explain the use of Pentaho reporting.

Pentaho reporting enables businesses to create structured and informative reports to easily access, format and deliver meaningful and important information to clients and customers. They also help business users to analyze and track consumer’s behavior for the specific time and functionality, thereby directing them towards the right success path.

What is Pentaho Data Mining?

Pentaho Data Mining refers to the Weka Project, which consists of a detailed tool set for machine learning and data mining. Weka is open-source software for extracting large users of information about users, clients and businesses. It is built on Java programming.

Is Data Integration and ETL Programming same?

No. Data Integration refers to passing of data from one type of systems to other within the same application. On the contrary, ETL is used to extract and access data from different sources. And transform it into other objects and tables.

Explain Hierarchy Flattening.

It is just the construction of parent child relationships in a database. Hierarchy Flattening uses both horizontal and vertical formats, which enables easy and trouble-free identification of sub elements. It further allows users to understand and read the main hierarchy of BI and includes Parent column, Child Column, Parent attributes and Child attributes.

What are variables and arguments in transformations?

Transformations dialog box consists of two different tables: one of arguments and the other of variables. While arguments refer to command line specified during batch processing, PDI variables refer to objects that are set in a previous transformation/job in the OS.

How to configure JNDI for Pentaho DI Server?

Pentaho offers JNDI connection configuration for local DI to avoid continuous running of application server during the development and testing of transformations.  Edit the properties in jdbc.propertiesfile located at…\data-integration-server\pentaho-solutions\system\simple-jndi.

Is Pentaho a Trademark?

Yes, Pentaho is a trademark.

What kind of data, cube contain?

The Cube will contain the following data:

  • Fact fields – Sales, Costs and Discounts
  • Time Dimension – with the following hierarchy: Year, quarter and Month
  • Customer Dimensions – one with location (Region, Country) and the other with Customer Group and Customer Name
  • Product Dimension – containing a Product Name

If we want to join 2 tables that are not in the same database. We can use the the “Database Join”.

How to sequentialize transformations?

it is not possible as in PDI transformations all of the steps run in parallel. So we can’t sequentialize them.

How we can use database connections from repository?

We can Create a new conversion or close and re-open the ones we have loaded in Spoon.

By default all steps in a transformation run in parallel, how can we make it so that 1 row gets processed completely until the end before the next row is processed?

This is not possible as in PDI transformations all the steps run in parallel. So we can’t sequentialize them. This would require architectural changes to PDI and sequential processing also result in very Slow processing.

Why can’t we duplicate fieldnames in a single row?

we can’t. if we have duplicate fieldnames. Before PDI v2.5.0 we were able to force duplicate fields, but also only the first value of the duplicate fields could ever be used.

Differentiate between Arguments and variables?

Arguments:

Arguments are command line arguments that we would normally specify during batch processing.

variables:

Variables are environment or PDI variables that we would normally set in a previous transformation in a job.

Define Pentaho Schema Workbench?

Pentaho Schema Workbench offers a graphical edge for designing OLAP cubes for Pentaho Analysis.

Brief about Pentaho Report designer?

It is a visual, banded report writer. It has various features like using subreports, charts and graphs etc.

What do you understand by the term ETL?

It is an entry level tool for data manipulation.

  1. What are the steps to Decrypt a folder or file?
  2. Right-click on the folder or file we want to decrypt, and then click on Properties option.
  3. Click the General tab, and then click Advanced.
  4. Clear the Encrypt contents to secure data check box, click OK, and then click OK again.

What is metadata?

The metadata stored in the repository by associating information with individual objects in the repository.

What is data staging?

Data staging is actually a group of procedures used to prepare source system data for loading a data warehouse.

Data staging is actually a group of procedures used to prepare source system data for loading a data warehouse.

Full Load means completely erasing the insides of one or more tables and filling with fresh data.
Incremental Load means applying ongoing changes to one or more tables based on a predefined schedule.

Define mapplet?

It creates and configure the set of transformation.

What do you understand by three tier data warehouse?

A data warehouse is said to be a three-tier system where a middle system provides usable data in a secure way to end users. Both side of this middle system are the end users and the back-end data stores.

What is ODS?

ODS is Operational Data Store which comes in between of data warehouse and staging area.

Differentiate between Etl tool and OLAP tool?

ETL Tool is used for extracting data from the legacy system and load it into specified database with some processing of cleansing data.
OLAP Tool is used for reporting process. Here data is available in multidimensional model hence we can write simple query to extract data from database.

We will be using PDI integrated in a web application deployed on an application server. We’ve created a JNDI datasource in our application server. Of course, Spoon doesn’t run in the context of the application server, so how can we use the JNDI data source in PDI?

If you look in the PDI main directory you will see a sub-directory “simple-jndi”, which contains a file called “jdbc.properties”. You should change this file so that the JNDI information matches the one you use in your application server.

After that you set in the connection tab of Spoon the “Method of access” to JNDI, the “Connection type” to the type of database you’re using. And “Connection name” to the name of the JDNI datasource (as used in “jdbc.properties”.

The Text File Input step has a Compression option that allows you to select Zip or Gzip, but it will only read the first file in Zip. How can I use Apache VFS support to handle tarballs or multi-file zips?

The catch is to specifically restrict the file list to the files inside the compressed collection.  Some examples:

You have a file with the following structure:

access.logs.tar.gz
access.log.1
access.log.2
access.log.3

To read each of these files in a File Input step:

File/Directory Wildcard
tar:gz:/path/to/access.logs.tar.gz!/access.logs.tar! .+

 Note: If you only wanted certain files in the tarball, you could certainly use a wildcard like access.log..* or something.  .+ is the magic if you don’t want to specify the children filenames.  .* will not work because it will include the folder (i.e. tar:gz:/path/to/access.logs.tar.gz!/access.logs.tar!/ )

You have a simpler file, fat-access.log.gz.  You could use the Compression option of the File Input step to deal with this simple case, but if you wanted to use VFS instead, you would use the following specification:

File/Directory Wildcard
gz:file://c:/path/to/fat-access.log.gz! .+

Finally, if you have a zip file with the following structure:
access.logs.zip/
a-root-access.log
subdirectory1/
subdirectory-access.log.1
subdirectory-access.log.2
subdirectory2/
subdirectory-access.log.1
subdirectory-access.log.2

You might want to access all the files, in which case you’d use:

File/Directory Wildcard
zip:file://c:/path/to/access.logs.zip! a-root-access.log
zip:file://c:/path/to/access.logs.zip!/subdirectory1 subdirectory-access.log.*
zip:file://c:/path/to/access.logs.zip!/subdirectory2 subdirectory-access.log.*

Note: For some reason, the .+ doesn’t work in the subdirectories, they still show the directory entries. :

Mention the major features of Pentaho?

 Direct Analytics on MongoDB: It authorizes business analysts and IT to access, analyze, and visualize MongoDB data. Science Pack: Pentaho’s Data Science Pack operationalizes analytical modeling and machine learning while allowing data scientists and developers to unburden the labor of data preparation to Pentaho Data Integration. Full YARN Support for Hadoop: Pentaho’s YARN mixing enables organizations to exploit the full computing power of Hadoop while leveraging existing skillsets and technology investments.

Define the Pentaho BI Project?

 The Pentaho BI Project is a current effort by the Open Source communal to provide groups with best-in-class solutions for their initiative Business Intelligence (BI) needs.

Which platform benefits from the Pentaho BI Project?

 Java developers who generally use project components to rapidly assemble custom BI solutions

ISVs who can improve the value and ability of their solutions by embedding BI functionality

End-Users who can quickly deploy packaged BI solutions which are either modest or greater to traditional commercial offerings at a dramatically lower cost.

What is the Pentaho Reporting Evaluation?

 Pentaho Reporting Evaluation is a particular package of a subset of the Pentaho Reporting capabilities, designed for typical first-phase evaluation activities such as accessing sample data, creating and editing reports, and viewing and interacting with reports.

Explain MDX?

 Multidimensional Expressions (MDX) is a query language for OLAP databases, much like SQL is a query language for relational databases. It is also a calculation language, with syntax similar to spreadsheet formulas.

Is Data Integration and ETL Programming the same?

 No. Data Integration refers to the passing of data from one type of system to another within the same application. On the contrary, ETL is used to extract and access data from different sources. And transform it into other objects and tables.

Explain Hierarchy Flattening?

 It is just the construction of parent-child relationships in a database. Hierarchy Flattening uses both horizontal and vertical formats, which enables easy and trouble-free identification of sub-elements. It further allows users to understand and read the main hierarchy of BI and includes the Parent column, Child Column, Parent attributes, and Child attributes.

Explain the Pentaho Report Designer (PRD)?

 PRD is a graphic tool to execute report-editing functions and create simple and advanced reports and help users export them in PDF, Excel, HTML, and CSV files. PRD consists of a Java-based report engine offering data integration, portability, and scalability. Thus, it can be embedded in Java web applications and also other application servers like the Pentaho server.

How to configure JNDI for Pentaho DI Server?

 Pentaho offers JNDI connection configuration for local DI to avoid continuous running of the application server during the development and testing of transformations. Edit the properties in jdbc.properties file located at…data-integration-serverpentaho-solutionssystemsimple-jndi.

Explain in brief the concept of Pentaho Dashboard?

 Dashboards are the collection of various information objects on a single page including diagrams, tables, and textual information. The Pentaho AJAX API is used to extract BI information while Pentaho Solution Repository contains the content definitions. The steps involved in Dashboard creation include

-Adding a dashboard to the solution

-Defining dashboard content

-Implementing filters

-Editing dashboards

Canfield names in a row duplicated in Pentaho?

 No, Pentaho doesn’t allow field duplication.

Does transformation allow filed duplication?

 “Select Values” will rename afield as you select the original field also. The original field will have a duplicate name of the other field now.

How to use database connections from the repository?

 You can either create a new transformation/job or close and reopen the ones already loaded in Spoon.

Define three major types of Data Integration Jobs?

 -Transformation Jobs: Used for preparing data and used only when there is no change in data until the transforming of data job is finished.

-Provisioning Jobs: Used for transmission/transfer of large volumes of data. Used only when no change is data is allowed unless job transformation and on large provisioning requirement.

-Hybrid Jobs: Execute both transformation and provisioning jobs. No limitations for data changes; it can be updated regardless of success/failure. The transforming and provisioning requirements are not large in this case.

\data-integration-server\pentaho-solutions\system\simple-jndi.

What are snapshots?

Snapshots are read-only copies of a master table located on a remote node which can be periodically refreshed to reflect changes made to the master table.

What is Pentaho Schema Workbench?

Pentaho Schema Workbench is the graphical edge for designing OLAP cubes for Pentaho Analysis.

What are the three major types of Data Integration Jobs?

Transformation Jobs: Used for preparing data and used only when there is no change in data until the transforming of data job is finished.

  • Provisioning Jobs: Used for transmission/transfer of large volumes of data. Used only when no change is data is allowed unless job transformation and on large provisioning requirement.
  • Hybrid Jobs: Execute both transformation and provisioning jobs. No limitations for data changes; it can be updated regardless of success/failure. The transforming and provisioning requirements are not large in this case.

Is Data Integration and ETL Programming same?

No. Data Integration refers to passing of data from one type of systems to other within the same application. On the contrary, ETL is used to extract and access data from different sources. And transform it into other objects and tables.

What major applications comprises of Pentaho BI Project?

The Pentaho BI Project encompasses the following major application areas:

  • Business Intelligence Platform
  • Data Mining
  • Reporting
  • Dashboards
  • Business Intelligence Platform

Explain Pentaho report Designer (PRD).

PRD is a graphics tool to execute report-editing functions and create simple and advanced reports and help users export them in PDF, Excel, HTML and CSV files. PRD consists of a Java-based report engine offering data integration, portability and scalability. Thus, it can be embedded in Java web applications and also other application servers like Pentaho BA server.

Can fieldnames in a row duplicated in Pentaho?

No, Pentaho doesn’t allow field duplication.

Does transformation allow filed duplication?

“Select Values” will rename a field as you select the original field also. The original field will have a duplicate name of the other field now.

 

So, this brings us to the end of the Pentaho Interview Questions blog. This Tecklearn ‘Top Pentaho Interview Questions and Answers’ helps you with commonly asked questions if you are looking out for a job in Pentaho or Business Intelligence Domain. If you wish to learn Pentaho and build a career in Business Intelligence domain, then check out our interactive, Pentaho BI Training, that comes with 24*7 support to guide you throughout your learning period.

https://www.tecklearn.com/course/pentaho-bi-certification-training/

Pentaho BI Training

About the Course

Pentaho BI Training from Tecklearn teaches you how to develop Business Intelligence (BI) dashboard using Pentaho BI tool from scratch. Pentaho is an open-source comprehensive BI suite and provides integration with Hadoop distribution for handling large dataset and doing reporting on top of it. This course explores the fundamentals of Pentaho Data integration, creating an OLAP Cube, integrating Pentaho BI suite with Hadoop, and much more through the best practices. Our Online Pentaho Training Course also provides real-time projects to enhance your skills and successfully clear the Pentaho Business Analytics Certification exam.

Why Should you take Pentaho BI Training?

  • The average annual pay for a Pentaho Developer is $124828 a year. -ZipRecuiter.com.
  • Around 2500 websites globally are using Pentaho BI and it has a market share of around 3% globally.
  • Pentaho is a suite of Business Intelligence products, which provide data integration, OLAP services, reporting, dashboarding, data mining and ETL capabilities. Pentaho is a one stop solution for all business analytics needs.

What you will Learn in this Course?

Data Modelling

  • Why need Data Modelling
  • Data Modelling Scope and Benefits
  • Data Model Analogy
  • Case Study

Introduction to Pentaho BI Suite

  • Overview of Pentaho Business Intelligence and Analytics tools
  • Pentaho Data Integration (PDI)
  • Pentaho Report Designer (PRD)
  • Pentaho Metadata Editor (PME)
  • Pentaho Schema Workbench (PSW)
  • Dashboard Capabilities

Installation

  • Installation of Java
  • Installation steps for Pentaho ETL Tool
  • Spoon Installation
  • Spoon Overview
  • Connection to Database

Retrieving Data from Flat or Raw Files using Pentaho

  • Working with Flat Files or Delimited Files
  • Different Use Cases
  • Read Data from different Delimited Files using Pentaho

Clustering in Pentaho

  • Basics of clustering in Pentaho Data Integration
  • Creating a database connection
  • Working with CSV Files

Pentaho Report Designer

  • Designing Basic Report containing Graphical Chart
  • Conditional Formatting and Studying the PRPT File Format
  • Building a Basic Report in PDF Report
  • Data Source Connection and Query Designer
  • Working with Group (Group Header, Group Footer)
  • API Based Reporting

Pentaho Data Integration – Transformation

  • What is Data Transformation
  • Step, Hop, Variable
  • Various Input and Output Steps
  • Transformation Steps, Big Data Steps and Scripting

Different Types of Transformation

  • Transformation Steps in Detail
  • Add sequence and use calculator
  • Generating Output
  • Data Validation

Slowly Changing Dimensions (SCD)

  • Slowly Changing Dimensions,
  • SCD Type I
  • SCD Type II
  • Deploying SCD

Pentaho Dashboard

  • Pentaho Dashboard
  • Passing parameters in Report and Dashboard
  • Drill-down of Report
  • Deploying Cubes for report creation
  • Working with Excel sheets
  • Pentaho Data integration for report creation

Understanding Cube

  • What is a Cube
  • Report and Dashboard creation with Cube
  • Creation and benefits of Cube

Pentaho Analyzer

  • Pentaho analytics for discovering
  • Blending various data types and sizes
  • Advanced analytics for visualizing data across multiple dimensions

Pentaho Data Integration (PDI) Development

  • PDI steps used to create an ETL job
  • PDI / Kettle steps to create an ETL transformation

Pentaho Administration

  • Creating and Managing Users and Roles
  • Security
  • Performance Tuning
  • Dashboard Creation with Advance Features

Got a question for us? Please mention it in the comments section and we will get back to you.

0 responses on "Top Pentaho Interview Questions and Answers"

Leave a Message

Your email address will not be published. Required fields are marked *