Components and Modules
Overview
There are 95 components available for data cleansing, validation, standardization, as well as a number of technical or auxiliary ones:
-
Generic (32)
-
Bulgaria (2)
-
Canada (14)
-
Czech Republic (6)
-
France (4)
-
Germany (3)
-
Russia (1)
-
Slovakia (2)
-
Slovenia (2)
-
Technical (9)
-
United Kingdom (7)
-
United States of America (13)
All PUBLIC components are ready to use in a standard engine package for free.
Component description
Template projects
Package name | Description | Availability |
---|---|---|
Data Cleansing Package (CA) |
Data Cleansing Package (Canadian knowledge base) is a tutorial which shows how party attributes, such as a person’s name, date of birth, social insurance number, company name, address, email, can be cleansed using premade Canadian or generic cleansing and validation components. The tutorial plan expects three input entities: In addition to three cleansed output entities, a denormalized entity is created - addresses and documents belonging to the same party are grouped, concatenated, and joined to the corresponding party. This entity can be further used for unification purposes. The cleansing functionality is mainly intended for batch processing but it can be enabled via web services as well to allow real-time data cleansing. |
|
Data Masking Module (CA) |
Data Masking Module (Canadian knowledge base) consists of a set of prebuilt components for common entities such as personal identification, address, and so on, which can provide complex Data Masking services. The masking functionality is primarily intended for batch processing but can be also enabled via web services, allowing for real-time masking of data coming into a system, such as transactional systems. Inherent data quality issues found in the source data are preserved during the masking process. When necessary, the data masking module populates the seed table with the results of the masking services so that a repository of the masks is kept in a centrally accessible location. Dynamic seed tables are technical database tables used to replace source strings, dates, integers by plausible alternatives. The content of this table is filled dynamically from source data and can be kept in a separate schema (with limited access rights to ensure the masking algorithms are not discovered). This serves several purposes:
|
|
Data Masking Module (US) |
Data Masking Module (US knowledge base) - See the description of Data Masking Module (CA). |
Generic
Component name | Description | Availability | ||||
---|---|---|---|---|---|---|
Cleanse and Convert Weight Unit and Value |
Used to parse values and units of weight and then convert the weight to the unit set as a parameter.
The component removes the duplicates from the input (if present) and then joins the input fields together. As a result, the component can be used even when all the information (both numerical value and unit) is entered into one field.
While there are no constraints on the |
Free |
||||
Cleanse Credit Card Number |
Used to validate a credit card number and determine the card details.
|
Free |
||||
Cleanse Date |
Used to parse dates or datetimes, validate them, apply business rules, and set them to the standard date and datetime formats.
Those are 'yyyy-MM-dd' for date and 'yyyy-MM-dd HH:mm:ss.SSS' for datetime, where The component has two optional parameters:
Where
Where When there is a datetime on input, the date and time parts have to be separated either by a space or a character When the source value contains the time zone part, the component adjusts the output value according to the local time zone (including daylight saving time adjustments).
This means the In addition to numeric dates, the component is also able to validate dates with a month name or its abbreviation (first three letters).
In this case, the date format contains By default, month names in 10 languages are supported: You can add other languages to the lookup.
See Java Examples of valid dates include: "January 1st 99", "January 1 1999". Example of invalid date: "January 1 99". In the latter case, we are able to determine that the number 99 can only mean the year, but for a different date in the same format, such as "January 12 13", this would no longer be possible. |
Free |
||||
Cleanse EAN Code (complex) |
Used to validate EAN codes. It includes:
Input: EAN code. Output: Standardized value of the EAN code, best existing (standardized or cleansed) value of the EAN code, data quality score of the EAN code, data quality imperfection explanations.
|
Free |
||||
Cleanse EAN Code (simple) |
Used to validate EAN codes. It includes only EAN-13. Input: EAN code. Output: Standardized value of the EAN code, data quality score of the EAN code, data quality imperfection explanations, best existing value of the EAN code (standardized, cleansed, or input value).
|
Free |
||||
Cleanse Email |
Used to validate and cleanse email addresses in the best possible manner. Although this component is powerful when it comes to transforming invalid email addresses into valid ones, keep in mind that a valid email address is not the same as an existing email address; a valid address only conforms to a standard. Therefore, any change in the source value can result in a non-existent output. In addition, not all email providers follow the RFC rules for email address format, and in that case, an existing email address is not the same as a valid email address (that is, the message is delivered even though the email address does not conform to the standard). Therefore, consider carefully what the component is used for.
|
Free |
||||
Cleanse International Bank Account Number |
Used to verify IBAN (International Bank Account Number).
|
Free |
||||
Cleanse Number |
Used to verify numbers (input as STRING). The output is provided in the following three formats: INTEGER, LONG, and FLOAT.
|
Free |
||||
Cleanse SWIFT code |
Used to verify SWIFT (Society for Worldwide Interbank Financial Telecommunication) codes. A SWIFT code, also known as the Bank Identifier Code (BIC), is a unique identification code for a particular bank.
The standardized value of SWIFT is the value which is valid according to the component behavior. The best existing value of SWIFT:
|
Free |
||||
Cleanse Vehicle Identification Number |
Used to validate VIN (Vehicle Identification Number).
|
Free |
||||
Country Code Identifier |
Used to extract a country code from an address.
The component performance depends on the quality of input data. For best results, we recommend using precleansed data.
|
Free |
||||
Derive Age from Birthdate |
Used to derive age from the date of birth. The current or custom date can be used as a baseline.
|
Free |
||||
Derive Name of a Day from Date |
Used to derive the name of a day from a date.
|
Free |
||||
Generic String to Boolean Conversion |
Used to convert certain strings with Boolean meaning (0/1, true/false) to Boolean type.
|
Free |
||||
Guess Contact Type |
Used to guess the contact type of the input value. It can identify possible emails, webpages, and phone numbers. Use this component in case you mostly have single contact values on input. If the component identifies only one single contact, you can use specific cleansing or validating components accordingly. This component is approximately twice as fast as Parse Contact Type component, which can parse the first occurrence of every (supported) contact type.
|
Free |
||||
Hierarchical Union Match of Person and Company |
Uses generic matching rules to identify and group the same records together depending on the party type (person or company). Hierarchical union strategy is used to minimize the number of false positive results. One matching key is chosen as the strongest one (personal identifier), and records with different keys are never grouped together. The output of the matching process strongly depends on the content and quality of input data. We recommend using preconfigured cleansing components to achieve the best matching results.
Mapping of input attributes depends on your matching requirements. Invalid mapping or invalid values might cause matching errors:
All input, changed, and discarded (for example, records with a timestamp that is older than the repository) records are returned from the matching repository. |
Free |
||||
Length Standardization |
Used to validate and standardize submitted units of length and their values. The component can convert length to the unit set as an optional parameter. Supported units: The component removes the duplicates from the input (if present) and then joins the input fields together. As a result, the component can be used even when all the information (both numerical value and unit) is entered into one field. Additional text is marked as comment and is also allowed. While there are no constraints on the
Rules for output: If a value and unit of length both can be parsed and the unit is verified, then the input is valid and can be processed without any transformation (example: input "15 m" converted into meters). Otherwise, the logic applied depends on the input and is described by the following explanation codes:
|
Free |
||||
Loqate Address Validate |
Uses the Loqate Server Engine to verify the input address string. The component supports the following service types:
Does not support CASS and AMAS verification.
Output of Verify (V) or Verify & Geocode service (V+G) includes major output attributes as defined in Loqate: Field Descriptions. The following output attributes are available only if Geocoding parameter is enabled:
|
Free |
||||
Loqate Cloud Address Web Lookup |
Uses the Loqate Cloud Service to verify or search input address string. The component supports the following service types:
The component does not support the following service types:
|
Free |
||||
Mask Credit Card Number (complex) |
Used to mask credit card numbers.
|
|||||
Mask Credit Card Number (simple) |
Used to anonymize credit card numbers.
|
|||||
Mask Date |
Used to mask arbitrary dates.
|
|||||
Mask Email |
Used to mask emails.
|
|||||
Mask English Word |
Used to mask words (English words are preferred).
|
|||||
Mask Number |
Used to mask numbers or IDs.
|
|||||
Parse Contact Type |
Used to identify and parse first occurrences of contact types from the input string. It can recognize emails, webpages, and phone numbers. Use this component when you expect multiple emails, phone numbers and/or web pages in one input column. The component tries to parse the first email, web page, and phone number into separate columns. The rest stays in its own output column. This component is approximately two times slower than the Guess Contact Type component.
|
Free |
||||
Standardize Currency |
Used to:
The component does several transformations and standardizations:
The rules for the output depend on the input type:
|
Free |
||||
Standardize Phone Number Format |
Used to format the worldwide variety of phone numbers into the international format
|
Free |
||||
Standardize String Length |
Used to standardize a string to a string of the specified length.
|
Free |
||||
Validate Country Code |
Used to check consistency between the country name (in various languages) and the country code (any of the three types of country codes: alpha-2, alpha-3 and numeric-3). If they match or only one of them is populated and valid, it returns the country name in the preferred language (English by default) and all three types of country codes.
|
Free |
||||
Validate Email |
Used to validate of email addresses.
|
Free |
||||
Validate International Dialing Code |
Used to validate international dialing codes (IDCs).
|
Free |
Bulgaria
Component name | Description | Availability |
---|---|---|
Cleanse Person Name BG |
Used to verify a person’s name, determine that person’s gender, split the full name into separate columns, and identify Latin and Cyrillic name analog.
|
Free |
Transliterate Cyrillic |
The component is used to transform Bulgarian Cyrillic input into Latin output.
|
Free |
Canada
Component name | Description | Availability | ||
---|---|---|---|---|
Address Identifier CA (step) |
Used for cleansing, standardization, and enrichment of CA addresses.
|
|||
Cleanse Business Number CA |
Used to validate Canadian Business Number. The business number (BN) is a common client identifier for businesses to simplify their dealings with federal, provincial, and municipal governments. It is based on the idea of one business, one number. Each business requires one BN for its legal entity.
Additional text before and after the identified business number is moved to the special output column.
The identified business number is standardized to the following format: |
Free |
||
Cleanse Company Name CA (simple) |
Used to clean Canadian company names (CN) and derive its legal form.
|
Free |
||
Cleanse Person Name CA |
Used to verify a person’s name, determine gender, and split name into separate columns.
|
Free |
||
Cleanse Phone Number CA (complex) |
Used to cleanse Canadian phone numbers. The component can also clean and validate foreign numbers (only on the level of international dialing code, that is, IDC code). The component validates standard Canadian numbers, however, it does not include non-regional and special numbers like toll-free numbers, emergency, or non-standard contact numbers. In comparison to the simple version, this component can identify foreign numbers, fictive numbers, comments, extensions, and intervals in number. The format of a valid output number can be modified.
|
Free |
||
Cleanse Phone Number CA (simple) |
Used to cleanse Canadian phone numbers.
|
Free |
||
Cleanse Social Insurance Number CA |
Used to validate Canadian Social Insurance Number (SIN).
The standardized value of SIN is the value which is valid according to the component behaviour (9-digit-only value). The best existing value of SIN is as follows:
|
Free |
||
Generate Dummy Party and Contact Data |
Used to randomly generate a specified number of records from party domain using Canadian context (names, conventions, etc.). Use this component if you wish to quickly create a set of fake but realistically looking data to test your solution with specific data volumes.
|
Free |
||
Mask Address CA |
Used to mask Canadian addresses in a smart way so that some validity and cross-validity is preserved.
Municipality, province, and postal code are expected to be in the correct input column, upper case letters, and valid. In case of masking arbitrary invalid data, the input format is not specified. However, correct mapping of components to input columns is required in all cases. |
|||
Mask Company Name CA |
Used to mask Canadian company names.
|
|||
Mask Person Name CA |
Used to mask a person’s name.
|
|||
Mask Phone Number CA |
Used to mask Canadian phone numbers.
Where
All digits are masked randomly and other characters are left as they are. |
|||
Mask Social Insurance Number CA |
Used to mask Canadian SIN.
|
|||
SERP SoA Report Builder CA |
Used to generate the Statement of Accuracy (SoA) for Canada Post Software Evaluation and Recognition Program (SERP). Canada Post uses a Software Evaluation and Recognition Program for Address Accuracy under which software developers can evaluate their address preparation software packages (a “Software Package” from here on) to determine if the Software Package meets Canada Post’s current criteria to qualify as “Recognized Software”. If a developer’s Software Package meets the criteria, Canada Post issues a “Notice of Recognition” declaring it to be a Recognized Software for the period specified in the notice (the “Recognition Period”). The Software Evaluation and Recognition Program (Address Accuracy) is a program to enable Canada Post Corporation’s (CPC) customers to benefit from incentive postage rates based on the accuracy of the data they use to address mail which is inducted to CPC.
|
Free |
Czech Republic
Component name | Description | Availability |
---|---|---|
Address Quick Search CZ |
Used to propose addresses based on the input string containing incomplete parts of address.
|
|
Address Identifier CZ |
Used to cleanse and parse input address data in order to assign an address code (which is known as an address identification). Address code is a unique ID in the official Czech address registry (RÚIAN). The core logic of the component is delivered by a set of Address Identifier steps with a predefined set of parsing rules, however, the component also works with replacement dictionaries. In addition to standard values, it also provides output address in the envelope-ready printable form. The component also supports evidentiary numbers ("evidenční číslo" in Czech).
Input requirements: As a standard in the Czech Republic, the first address line usually contains the street or the city district or part followed by a number, the second line usually contains the city or the city district or part, and the third line usually contains the postal code. The address structure is complex and the component can handle multiple situations when some of the information is missing (when there is no street net, for example, or by an error in ETL processes). However, the input address lines must contain enough information to unambiguously find the address code in the register. From this point of view, the component is capable of handling the following scenarios:
|
Free |
Cleanse Company Name and Registration Number CZ |
Used to validate and standardize company registration number and company name.
The standardized registration number has the |
Free |
Cleanse Phone Number CZ |
Used to cleanse and validate Czech phone numbers. The component validates all Czech numbers, including non-regional and special numbers like VOIP, shared-price, audiotex (nine digit numbers). It does not include emergency or non-standard contact numbers, for example, 112, 150, 155, 156, 158, 1188, 116111.
Example of standardized phone number: |
Free |
Cleanse VAT Number CZ |
Used to validate and standardize Czech VAT numbers. The component does not check whether the VAT number exists, only the checksum digits.
The standardized VAT number is in format |
Free |
Decline Name CZ |
Used to decline Czech names into 4th and 5th grammatical case (accusative and vocative).
|
Free |
France
Component name | Description | Availability |
---|---|---|
Cleanse Person Name FR |
Used to verify a French person’s name, determine gender, and split the name into separate columns.
|
Free |
Cleanse Phone Number FR |
Used to cleanse and validate French phone numbers. The component validates all French numbers, including non-regional and special numbers like VOIP, shared-price, audiotex (nine digit numbers). It also accepts emergency and non-standard contact numbers, for example, 15, 17, 18, 112.
Example of standardized phone number: |
Free |
Cleanse Social Security Number FR |
Used to validate French Social Security Numbers (Numéro de sécurité sociale).
The standardized SSN would be evaluated from any input containing 13-15 characters, which fits the conditions explained in the section about the component behavior. |
Free |
Cleanse VAT Number FR |
Used to validate and standardize French VAT number. The component does not check whether the VAT number exists, only the checksum digits.
The standardized VAT number has the |
Free |
Germany
Component name | Description | Availability |
---|---|---|
Cleanse Company Name DE (simple) |
Used to cleanse German company names.
|
Free |
Cleanse Person Name DE |
Used to verify a person’s name, determine gender, and split name into separate columns.
|
Free |
Validate Phone Number DE |
Used to verify a phone number and split a comment into separate columns. The component can verify mobile phone numbers and geographic phone numbers.
The standardized value of the phone number is the value which is valid according to the component behavior. The best existing value of phone number:
If the phone number is with IDC, then it starts with a plus sign ( |
Free |
Russia
Component name | Description | Availability |
---|---|---|
Address Identifier RU |
Used to cleanse, standardize, and enrich Russian addresses using FIAS reference data.
|
Slovakia
Component name | Description | Availability |
---|---|---|
Address Identifier SK |
Used to cleanse and parse input address data in order to assign an address code (which is known as address identification). Address code is an unique ID in the official Slovak address registry (Register adries - data.gov.sk). The core logic of the component is delivered by a set of Address Identifier steps with a predefined set of parsing rules, however, the component also works with replacement dictionaries. In addition to standard values, it also provides the output address in the envelope-ready printable form.
Input requirements: As a standard in Slovakia, the first address line usually contains the street or the city district or part followed by a number, the second line usually contains the city or the city district or part, and the third line usually contains the postal code. The address structure is complex and the component can handle multiple situations when some of the information is missing (when there is no street net, for example, or by an error in ETL processes). However, the input address lines must contain enough information to unambiguously find the address code in the register. From this point of view, the component is capable of handling the following scenarios:
|
Free |
Address Quick Search SK |
Used to propose addresses based on the input string containing incomplete parts of address.
|
Slovenia
Component name | Description | Availability |
---|---|---|
Cleanse Person Name SI |
Used to verify a person’s name, determine the gender, and split the full name into separate columns.
|
Free |
Cleanse Phone Number SI |
Used to cleanse and validate Slovenian phone numbers. The component validates all Slovenian numbers, including non-regional and special numbers like VOIP, shared-price, audiotex (9-digit numbers). It also accepts emergency and non-standard contact numbers, for example, 112.
Example of standardized phone number: |
Free |
Technical
Component name | Description | Availability | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Accumulate Counter |
Used to simulate the accumulate counter (increase sequence or get actual value of the sequence).
|
Free |
|||||||||
Combine Words |
Used to make 2-to-10-element combinations from input words.
|
Free |
|||||||||
DTAUS Reader |
Used to prepare Generic Data Reader step for reading DTAUS file (Datenträgeraustauschverfahren).
Rules for output: The output contains the part C only. The part A (header information) and part E (footer information) are also included but not added to the output. |
Free |
|||||||||
Find Related Words |
Used to find words related to the input word. The relation is defined by the input symbol, which is used also in the WordNet Project. The component works with synonyms of all possible contexts or meanings together - for example, "brother" as "sibling" and "brother" as "monk". This is a technical component that can be used for building a special dictionary or component. Althought the component can be used in some linguistics- or statistics-based applications, using it in a precise production flow should be done only with caution. It can be used (more or less directly):
Moreover, it contains a functionality that can be used for work with synonyms.
|
Free |
|||||||||
Generate Typo |
Used to generate random typos in the input string.
|
Free |
|||||||||
Search Company Details |
Used to search additional information about a company in the lookup by its name.
|
Free |
|||||||||
Smart Character Replacement |
Used to replace individual characters in the input string by characters from a replacement string based on their position.
Examples (put the phone number to different formats or masks):
|
Free |
|||||||||
Smart Hash Function for All Characters |
Used to hash individual characters (vowel to vowel, consonant to consonant, digit to digit, special character to special character).
|
Free |
|||||||||
Smart Hash Function for Digits |
Used to hash digits in the input string by using a seed table (digits in records with the same
|
Free |
United Kingdom
Component name | Description | Availability |
---|---|---|
Address Identifier GB (step) |
Used to cleanse, standardize, and enrich GB addresses and, if possible, find UDPRN (Unique Delivery Point Reference Number).
|
|
Cleanse Company Name GB (complex) |
Used to standardize British company names and its legal form using available dictionaries.
|
Free |
Cleanse National Health Service Number GB |
Used to validate Britain National Health Service Number (NHS number).
The allowed input NHS number formats depend on optional comments before and/or after NHS number:
The standardized value of NHS number is the value which is valid according to the component behavior (10-digit-only value). The best existing value of NHS number:
|
Free |
Cleanse National Insurance Number GB |
Used to validate GB National Insurance Number (NINO). The number is described by the United Kingdom government as a "personal account number".
Allowed input NINO formats depend on optional comments before and/or after NINO.
If there are multiple NINOs, all except the first one are considered as comments.
For example, if
The standardized value of NINO is the value which is valid according to the component behavior (two letters and nine digits and one letter only value). The best existing value of NINO:
|
Free |
Cleanse Person Name GB |
Used to verify a person’s name, determine gender, and split name into separate columns.
|
Free |
Cleanse Phone Number GB |
Used to validate a phone number. The component can verify mobile phone numbers (starting with '07') and geographic phone numbers (starting with '01' or '02'). It is also able to deal with redundant comments.
The standardized value of the phone number is the value which is valid according to the component behavior. The best existing value of phone number:
If the phone number is with IDC, then it begins with a plus sign ( |
Free |
Cleanse VAT Number GB |
Used to validate and standardize British VAT number. The component does not check whether the VAT number exists, only the checksum digits.
The standardized VAT number has one of the following formats (d = digit): |
Free |
United States of America
Component name | Description | Availability |
---|---|---|
Address Identifier US (step) |
Used to cleanse, standardize, and enrich US addresses.
|
|
Address Quick Search US |
Used to propose addresses based on the input string containing incomplete parts of address.
|
|
Cleanse Company Name US (simple) |
Used to cleanse US company name (CN) and search for the legal element.
|
Free |
Cleanse EIN and ITIN US |
Used to validate and cleanse US Employer Identification Number (EIN) and Individual Taxpayer Identification Number (ITIN).
|
Free |
Cleanse Person Name US |
Used to verify a person’s name, determine gender, and split name into separate columns.
|
Free |
Cleanse Phone Number US (complex) |
Used to cleanse US phone numbers. The component can also cleanse and validate foreign numbers (only by international dialing code, that is, IDC code). The component also validates standard US numbers, however, it does not include non-regional and special numbers like toll-free numbers, emergency, or non-standard contact numbers. In comparison to the simple version, this component can identify foreign numbers, fictive numbers, comments, extensions, and intervals in a number. The format of a valid output number can be modified.
|
Free |
Cleanse Phone Number US (simple) |
Used to cleanse US phone numbers.
|
Free |
Cleanse Social Security Number US |
Used to validate US Social Security Numbers (SSN).
The standardized SSN is derived from any input containing seven to nine numbers (valid input according to the component behavior). |
Free |
Mask Address US |
Used to mask US addresses in a smart way so that some validity and cross-validity is preserved.
The expected input format for the street line: Street line can contain these components: primary address number, predirectional, street name, suffix, postdirectional, secondary address identifier, secondary address, rural road identifier and number, general delivery identifier, PO box identifier and number. Which elements are used depends on the address type. The following street line patterns are possible:
The city, state, and ZIP code are expected to be in the correct input column and valid. In case of masking arbitrary invalid data, the input format is not specified. However, correct mapping of address lines to input columns is required in all cases. |
|
Mask Company Name US |
Used to mask US company names.
|
|
Mask Person Name US |
Used to mask a person’s name using a translation lookup files or a transliteration.
The name can contain any characters if it is found in the lookup with masked names. If the name is not found, then any letter in the name is translated to another letter. Special characters and numbers in the name are not translated. |
|
Mask Phone Number US |
Used to mask US phone numbers.
Where
All digits are masked randomly and other characters are left as they are. |
|
Mask Social Security Number US |
Used to mask US SSN. SSN is masked randomly, using seed tables. The format and validity of SSN is preserved. Characters before or after the parsed SSN are not masked.
|
Was this page useful?