ETL processes are key parts of Business Intelligence systems. Their creation involves many challenges that organizations must face if they want to base decision-making processes on real-time data analysis. And here, tools such as Pentaho Data Integration can help. It is on this idea that we will base another set of articles from the “Let’s integrate” series. The main topic? The integration of Oracle APEX with the previously mentioned Pentaho Data Integration (PDI).
In applications that we create for our clients, we often face challenges related to the design and implementation of ETL processes. For example, for systems whose main purpose is to prepare and send Standard Audit File for Tax files (SAF-T), the challenge will be to download sales data from files in a format such as xlsx, csv, txt and even MT940 that contain information about bank transfers. For example, ODS systems will be focused on integrating data from many sources where, in addition to files, source data is stored in databases such as PostgreSQL and MS SQL. Among these many use cases, one requirement is common to all: the solution should be ready for quick and easy addition of new and, most importantly, different data sources that can be transformed and loaded into previously created systems. This is exactly what Pentaho DI gives us, which, thanks to its advanced graphical interface, allows you to create ETL processes with no knowledge of programming languages. The implementation of a simple ETL process will be the main topic of this article, which I hope will inform the reader of the possibilities of this tool.
Pentaho, a few words of introduction
Pentaho is a Business Intelligence tool that provides the possibility of creating ETL processes, data integration, creating professional dashboards and reporting modules. In 2017, Pentaho became part of Hitachi Vantara, a new company that unites the activities of Pentaho, Hitachi Data Systems and the Hitachi Insight Group. The tool uses the Apache license version 2.0. In this article, we will focus on Pentaho Data Integration (one of the Pentaho modules). PDI, also known as Kettle, consists of a data integration engine and GUI application that allows the user to define ETL processes using jobs and transforms.
Pentaho DI consists of four modules:
- Spoon – a graphic tool that allows you to easily create ETL transformations. This tool provides typical data flow modeling functions, such as reading, validation, tuning, transformation, writing, and supports communication with a wide range of databases. Transformations created in the Spoon tool can be started directly from the user interface or using the Pan application.
- Pan – a tool used to run data transformation designed in Spoon and saved in the repository or in the XML file.
- Kitchen – an application that allows you to run jobs created in Spoon in batch mode.
- Carte – Carte Server is a simple web server that lets you run, monitor and stop transforms and jobs remotely.
In this series of articles, we will use Pentaho Data Integration 8.0, Oracle 12c database and Oracle APEX 18.1.
First transform = first ETL process
In this chapter, we will use the specified business case. Suppose we want to calculate the total salary of employees including bonuses. The input data is in a CSV file. The employee’s total salary should be loaded into the EMPLOYEES table.
From the main menu choose File-> New-> Transformation. Choose the following elements from the side menu:
- from the input section element CSV file input;
- from the Transform element Calculator section;
- from the section Output element Table Output;
Start building the ETL process. Link the elements in the following order: CSV file input (Extract), Calculator (Transform), Table Output (Load);
Then double-click the CSV file input element. Choose the location of the source file and choose the appropriate delimiter.
In the Calculator, element creates a new field TOTAL_SALARY which is the sum of the field values SALARY and BONUS.
In Table output, define the connection to the target database, the schema and the name of the table.
Choose Run from the top menu. The transform is started. On the screen, we can see the logs of launching our transform.
Pentaho DI, thanks to its graphic interface, allows the quick and easy design of an ETL process. In this article, we have shown only some of the functionality of this tool. In the next article, we’ll focus on running transforms from the Carte server using a solution implemented in Oracle APEX.