Pentaho Data Integration is a powerful platform which allows developers to create fast, scalable, multi-platform ETL processes. It delivers easy to use graphical interface for data integration and supports most of popular databases.

Today most commercial application are developed as web application, so it is important to integrate ETL processing with Java code.

Pentaho Data Integration can be used to:

  • periodically import files from file-system to database with data format conversion,
  • migration from one database system to another (e.g. Apache Derby to Oracle DB)
  • parsing Excel/XML files and uploading data into database
  • exporting reports from database to remote locations (via sftp / scp)

How to work with Pentaho Data Integration?

There are three ways to start Pentaho jobs and transformations:

  • Spoon – application with graphical user interface,
  • Kitchen – console application,
  • Carte – web server.

When working with Pentaho to develop transformations and jobs you can use standalone graphical application called Spoon. It provides easy to use graphical interface for development and testing. Unfortunately, because of GUI, it can’t be executed from web application.

pentaho spoon example

Sample transformation in Spoon

Pentaho is delivered also with “kitchen” console application, which can be executed from command line as follows:

This way jobs can be run by cron or executed periodically from application. Main disadvantage is that you need to monitor and parse output stream from console, if you need to get output status or stacktrace.

To run jobs from web application the easiest way is to use Carte Server. It supports communication with GET requests and XML response.

Carte jobs overview Carte server details

Communication with Carte

New job can be run by sending GET request to URL:

In result Carte server response with status and unique job id:

Unique job id must be stored to get details of a running job.

Job details can be downloaded also as an XML file. Carte server must be requested with job name and id:

where JOB_NAME is job file name in repository (without kjb extension), JOB_ID is unique job id returned by previous runJob command.

Example XML response:

Result contains status (status_desc), full stacktrace of transformation (log_text) and statistics informations like lines read from file or saved to database, that can be used to debugging.

Java examples

To integrate Carte server with Java standalone or web application we can use Apache HTTP Client with XML parser.

Sample method to retrieve XML document from url could look like this:

Starting application is achieved by sending GET request with job file name and parsing output, looking for unique id of started job:

Checking status is more complex, because of asynchronous nature of XML communication. Application must poll status of job in intervals to find out if job finished running. In web application it should be executed by cron/scheduler.

Response contains status description in natural text, so we need to find out if status is final (success or error) or job is still running. Carte returns final statuses:

Value in status_desc Description
OK Job started with success
Finished Job finished with success
Finished (with errors) Job finished with execution errors
Stopped Job stopped by user
Stopped (with errors) Job stopped by user with execution errors

Example java code:

Summary

Pentaho Data Integration is powerful platform that can be easily combined with java applications.
It allows developers to create fast, multi-threading ETL processes for business data processing.

Example project (with sources used in this article) can be downloaded from: Pentaho example project

Share