Standardizing Values in the Data Warehouse - Pentaho Data Integration ETL Tutorial
Pentaho Data Integration ETL tool has many built in capabilities that assist the programmer in building business intelligent data for the organization.
What Will I Learn?
- You will learn Converting Values to Characters or Strings using ETL
- You will learn Converting Data to Integer
- You will learn Setting All Numeric Values in the Same Format
Requirements
Pentaho Data Integration ( ETL) - also called Kettle
Difficulty
- Intermediate
Tutorial Contents
- How to Convert Values to Characters or Strings
- Incoming Values as Dates
- The Value Converted to a Decimal Numeric
- Need a Value in Hex
- Converting to Integer
- All Numeric Values in the Same Format
Standardizing Values in the Data Warehouse - Pentaho Data Integration ETL Tutorial
There are multiple was to generate intelligence from data. A programmer can derive intelligence from an individual field, multiple fields, decision constructs, and subsets of field contents. Using the elements from the extract step the programmer can do many things utilizing function expressions.
How to Convert Values to Characters or Strings
ToChar returns the value of a date or a number as a character representation of the value. Using the ToChar on a date field requires the use of the format parameter. An example includes myage=46 would be converted as ToChar(myage) returning a value of '46'.
Converting dates is a bit trickier. Consider MyDate = July 13 1963, the example ToChar(MyDate, 'MM/DD/YY') would return '07/13/63' but ToChar(MyDate,'YYYYMMDD') would return 19630713 . It is important that the date formats used as a parameter are carefully considered to follow the accepted format of the data warehouse.
Additionally, a TRUE or FALSE boolean value can be returned as ToChar(1=2) would return FALSE.
Incoming Values as Dates
ToDate returns the value of a character, date, or integer value as a date representation. Again the parameter of the date format must be carefully considered. NewDate = '18/01/2018'. If the parameter is set incorrectly as in ToDate(NewDate,'DD/MM/YYYY') then the value returned would be January 18 2018 instead of December 01 2018.
If an integer value is used as the input value the date format must be Julian (the number of days since Dec.1, 4713BC). This function expression can also be used to apply date formatting to any value coming in as a date.
Suppose your invoicing system in Europe uses a different date format than your invoicing system in the United States. Using the ToDate function allows that all dates entering the warehouse can be easily standardized.
The Value Converted to a Decimal Numeric
ToDouble converts a value to a double precision, floating point number. When the incoming value is already a numeric value it simply returns the incoming value. When the value is a character the returned value represents the string from the left to the first character that is not a digit or a decimal point.
When the value is a date, ToDouble returns the Julian representation of the value, and for BOOLEAN values it returns an integer value of 1 or 0. Where Invalue = 123.45, ToDouble(Invalue) would return 123.45. In the case of Invalue = '123..45' then ToDouble(Invalue) would return 123.00 because when it found the second period in the value it would stop processing.
Utilizing ToDouble(1=2) would cause a FALSE to be generated and a value of 0 returned.
Need a Value in Hex
ToHex will convert any value incoming to the function into its hex string representation. While this is not very useful to a business user, if there tends to be data quality problems this function can be used to assist the programmer in finding and resolving the issue. ToHex('ABC') would return 0x41424300.
Converting to Integer
ToInteger will taking an incoming value and return a number as a type integer. For input values of type integer it returns the value unchanged. When the input value is a decimal or floating point number the returned value is the whole number with the decimal positions truncated.
If using this function on a date, the equivalent returned value is a Julian date. Finally, if the value being acted upon is a BOOLEAN type, the result will be a 1 or a 0. ToInteger(123.45) would return 123 and ToInteger(1=1) would return 1.
All Numeric Values in the Same Format
ToNumber is used to incorporate numeric formatting. The format of the function requires the programmer to predetermine the maximum number of digits the number can contain and the maximum number of digits to the right of the decimal position.
This can be useful when working with values that absolute precision is not necessary and causes overhead that is difficult to maintain. ToNumber(8.544444,15,2) would truncate the value to 8.54, losing the decimal precision.
While standardizing the values in the data warehouse can seem intimidating at first the tool has predefined functionality to allow the programmer to do bring the data into line regardless of the source system the data initiates from.
Posted on Utopian.io - Rewarding Open Source Contributors





Thank you for the contribution. It has been approved.
You can contact us on Discord.
[utopian-moderator]
Hey @yissakhar I am @utopian-io. I have just upvoted you!
Achievements
Suggestions
Get Noticed!
Community-Driven Witness!
I am the first and only Steem Community-Driven Witness. Participate on Discord. Lets GROW TOGETHER!
Up-vote this comment to grow my power and help Open Source contributions like this one. Want to chat? Join me on Discord https://discord.gg/Pc8HG9x