Guided Tutorial for Pentaho Data Integration using Oracle
In the data integration exercise, you will use the Pentaho Data Integration tool to transform two data sources and load data into an Oracle fact table. You will perform transformations to parse date strings, combine fields, and perform validation checks. Before starting this tutorial, you need to install necessary software, download data sources, and create tables used in the tutorial.
1. Tutorial Prerequisites
Before starting this tutorial, you should download and install the server and client for either Oracle or MySQL server. You can find details in Module 1 about Oracle installation. If you have access to a remote Oracle server (perhaps through your employer), you do not need to install the server software on your own machine.
The tutorial uses the Store Sales data warehouse as depicted in Figure 1. Sales is the fact entity type surrounded by 1-M relationships with dimension entity types, Item, Customer, Store, and TimeDim. The schema design has a snowflake for the 1-M relationship from Division to Store. In the table design, table names have been preceded with the prefix “SS” to avoid conflicts with other tables. Thus, the fact table is SSSales, not Sales as shown in the ERD of Figure 1.
The class website contains documents for Oracle and MySQL. You need to create and populate the tables using one of these documents. The Oracle document also contains a statement to create a sequence object for the SSSales table.
2. Creating your First Transformation
1. After starting Pentaho Data Integration, you will see the opening window (Figure 2) and the Spoon window (Figure 3).
2. Click (New) in the upper left corner of the Spoon window.
Figure 3: Spoon Opening Window
3. Load the first data source from Excel
Step 3 – Save the transformation following File → Save. You will see the empty transformation window in the Spoon (Figure 5).
Figure 6: New Microsoft Excel Input Node
Select and drag a Microsoft Excel Input step into the canvas on the right.
In the tab named Files, click the button “Browse…” and locate the Excel file that you downloaded from the class website. Then, Click “Add” to add the file to the selected files area.
In the tab named Sheets, click the button “Get sheetname(s)…”. There will appear an Enter List (Figure 8) to choose sheets. Select Sheet 1, press “>” to move it into the right area. Click OK.
Figure 9: Fields Window for Microsoft Excel Input Property Editing
Click OK at the bottom of the window. The input icon will change to the SSExcel icon displayed in Figure 10.
Create a “hop” between the SSExcelSource (Excel file input) step and the Filter Rows step. Hops are used to describe the flow of data in your transformation. To create the hop, click the SSExcel Source (Excel file input) step, then press the <SHIFT> key down and draw a line to the Filter Rows step (Figure 11).
Double-click the Filter Rows step. The Filter Rows edit properties dialog box appears (Figure 13).
Figure 14: Condition Fields Selection Window
Click on the expression and add constraints for the next column similarly to what you did for “SalesUnits”
Click on UP. This will allow you to see both conditions joint by AND
Figure 16: Filter Conditions Window
Save your transformation.
Figure 17: Filter Results Selection List
Double-click the Sort Rows step to open its edit properties dialog box (Figure 18). Click “Get Fields” to obtain the fields. Delete other fields except the Day, Month and Year fields. Then click Ok.
4. Using a Database Connection to Lookup Columns from Oracles tables
Connections that are available for use with a transformation or job are listed under Database Connection node in the explorer View in Spoon.
There are several ways to define a new database connection:
Step 1 – Access the SSTimeDim table from Oracle database.
Under the Design tab, expand the contents of the Input node.
Figure 19: Property Edit Window of Table Input Node
Click “New…” next to the connection field. You must create a connection to the database. The Database connection dialog box appears.
Connection for Oracle Database Virtual Box Appliance. This connection requires that you have PDI installed on the Oracle Virtual Box, not Windows. The predefined connection in SQL Developer uses the privileged account, SYSTEM. You can use ORCL or CDB1 as the service name in the connection string.
Connection Name: Oracle12cDB
Port Number:
User Name: SYSTEM *** or other user name that you created. ***
Connection Type: Oracle
Host Name:
Password: *** use the administrative password that you gave during installation ***
Access: Native (JDBC)
Database Name:
(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=localhost)(PORT=1521)))(CONNECT_DATA=(SERVICE_NAME=PDBORCL)))
Click “Test” to test the connection. Then success test result is shown by Figure 21.
Add another sort rows component Sort rows 2, and a hop connecting the SSTimeDim step. In the field specification (Figure 23), delete other fields except TIMEDAY, TIMEMOHTH, TIMEYEAR fields.
Figure 24: Two Sort Rows Nodes Connected to Merge Join Node
Double-click the Merge Join step to specify its properties (Figure 25). Set First step as Sort rows, Second step as Sort rows 2, and Join Type as INNER. Click both of the “Get key fields” at left and right to get the possible fields to join. In the left table, delete other fields except Day, Month and Year fields. In the right table, delete other fields except TIMEDAY, TIMEMONTH, and TIMEYEAR fields. Then click OK.
Similar to getting data from the SSTimeDim table in the previous section, inner joining these tables requires Table Input components. First, we set the connection and query properties for the SSItem table. Note that these tables should exist in your Oracle schema before these steps.
Drag and drop the Table Input 2 into the design pane.
The global view of all nodes and connections after Step 2 is shown by Figure 26.
For the SSStore step, connect the StoreID (from Excel file) and STOREID (from Database) fields.
The global view of all nodes and connections after Step 3 is shown by Figure 27.
Click and drag an Add sequence step into your transformation; create a hop between the Merge Join 4 and Add Sequence steps (Figure 28).
Double click on the newly created component to open its Basic Settings pane.
Figure 29: Property Edit Window of Add sequence node
5. Insert data into the SSSales table
Under the Design tab, expand the contents of the Output node.
Double click the Insert/Update component, to specify its properties (Figure 31). Set the step name as SSSales. Select the connection as Oracle12cDB. Type in the Target table as SSSales. DON’T click the button “Get fields”. Instead, select the names from the two table fields and set the comparator between them to “=”. The final window should look like Figure 31.
Select the SSSales step and run a preview by clicking on . In the transformation debug dialog click on Quick Launch (Figure 33).
Connect to your Oracle account (on your PC or remote server) so you can verify the number of rows in the SSSales table. You should see 104 rows with 8 new rows added to the 96 rows in the sample data (Figure 35).
6. Load second data source from Access
Under the Design tab, expand the Input node. Figure 36 shows the Design table and input node.
Figure 37: Property Edit Window of Microsoft Access Input Node
Set name for the Access Input as Sales and specify the Excel data source path in the Files tab.
Figure 39: Fields Window for Microsoft Access Input Property Editing
Figure 41: Sales Node Icon
Create a hop between the Sales (Access file input) step and the Filter Rows step. Hops are used to describe the flow of data in your transformation. To create the hop, click the Sales (Access file input) step, then press the <SHIFT> key down and draw a line to the Filter Rows step.
Alternatively, you can draw hops by hovering over a step until the hover menu appears. Drag the hop painter icon from the source step to your target step.
Click on the comparison operator (set to = by default) and select the IS NOT NULL function and click OK.
Click the button , add constraints for other columns (Figure 43).
7. Separate SalesDay fields into Day, Month, Year fields
In this part of the tutorial, you will use the Select Values step to change the format of the myDate field and the Split Fields step to parse the field into date components.
Under the Design tab, expand the contents of the Transform node.
Double-click the Select values step to open its edit properties dialog box.
In the tab named Metadata, click the button “Get fields to change”, to get the fields to change, which is shown by Figure 45. Change the Type of field myDate as String, change its Format as dd-MM-yyyy. Click OK.
Figure 46: Create Split Fields in Spoon
Figure 47: Property Edit Window of Field Splitter Node
Click OK.
8. Lookup Columns from the Oracle tables
Step 1 – Access the SSTimeDim table from Oracle database.
Under the Design tab, expand the contents of the Input node.
Provide the settings for connecting to the database as shown in the Figure 20.
Connection Name: Oracle12cDB
Access: Native (JDBC)
You should replace the IP address, port number and the service name. Also, you need to use your assigned user name and password. Do not use ISMG6480ClassStudent as the user name.
Double-click the Sort Rows step to open its edit properties dialog box. Click “Get fields” to obtain the fields. Delete other fields except the Day, Month and Year fields. Then click Ok.
Add one more sort rows component Sort rows 2, and a hop connecting the SSTimeDim step. In the field specification, delete other fields except TIMEDAY, TIMEMOHTH, TIMEYEAR fields.
Figure 49 shows the global view of all nodes and connections after Step 1.
For SSCustomer step, connect CustID (from Excel file) and CUSTID (from Database) fields.
For SSStore step, connect StoreID (from Excel file) and STOREID (from Database) fields.
Under the Design tab, expand the contents of the Transform node.
Click and drag Add sequence step into your transformation; create a hop between the Merge Join 4 and Add Sequence steps (Figure 51).
Figure 52: Property Edit Window of Add sequence node
9. Insert data into the SSSales table
Figure 53: Connect Insert/Update Node to Last Merge Join Node
Figure 54: Inserted Data in Oracle Database