User Community Service Desk Downloads
If you can't find the product or version you're looking for, visit support.ataccama.com/downloads

Components and Modules

Overview

There are 95 components available for data cleansing, validation, standardization, as well as a number of technical or auxiliary ones:

  • Generic (32)

  • Bulgaria (2)

  • Canada (14)

  • Czech Republic (6)

  • France (4)

  • Germany (3)

  • Russia (1)

  • Slovakia (2)

  • Slovenia (2)

  • Technical (9)

  • United Kingdom (7)

  • United States of America (13)

All PUBLIC components are ready to use in a standard engine package for free.

Component description

Template projects

Package name Description Availability

Data Cleansing Package (CA)

Data Cleansing Package (Canadian knowledge base) is a tutorial which shows how party attributes, such as a person’s name, date of birth, social insurance number, company name, address, email, can be cleansed using premade Canadian or generic cleansing and validation components.

The tutorial plan expects three input entities: party, address, and document. If this condition is not met, a few modifications are needed.

In addition to three cleansed output entities, a denormalized entity is created - addresses and documents belonging to the same party are grouped, concatenated, and joined to the corresponding party. This entity can be further used for unification purposes.

The cleansing functionality is mainly intended for batch processing but it can be enabled via web services as well to allow real-time data cleansing.

Contact us

Data Masking Module (CA)

Data Masking Module (Canadian knowledge base) consists of a set of prebuilt components for common entities such as personal identification, address, and so on, which can provide complex Data Masking services. The masking functionality is primarily intended for batch processing but can be also enabled via web services, allowing for real-time masking of data coming into a system, such as transactional systems.

Inherent data quality issues found in the source data are preserved during the masking process. When necessary, the data masking module populates the seed table with the results of the masking services so that a repository of the masks is kept in a centrally accessible location.

Dynamic seed tables are technical database tables used to replace source strings, dates, integers by plausible alternatives. The content of this table is filled dynamically from source data and can be kept in a separate schema (with limited access rights to ensure the masking algorithms are not discovered). This serves several purposes:

  • A unique mask is generated for each unique value.

  • Access to the masked values and their original values can be handled securely at the database level.

  • It provides the ability to unmask a particular value if needed.

Contact us

Data Masking Module (US)

Data Masking Module (US knowledge base) - See the description of Data Masking Module (CA).

Contact us

Generic

Component name Description Availability

Cleanse and Convert Weight Unit and Value

Used to parse values and units of weight and then convert the weight to the unit set as a parameter.

  • Input: value (STRING), unit (STRING).

  • Output: Parsed value (STRING), validated unit (STRING), standardized value (STRING), standardized unit (STRING), parsed out comment (STRING), data quality score (INTEGER), explanation codes (STRING).

  • Supported units: Kilograms (kg), milligrams (mg), grams (g), metric tonnes (t; equal to 1000 kg), carats (ct; 200 milligrams), grains (gr), drachms (dr), ounces (oz), pounds (lb), stones (st), quarters (qr), imperial tons, slugs, short ton (used in the US), US hundredweight (defined through SI wherever applicable).

  • Allowed input value formats:

    • String matching any pattern from the following list.

    • String matching any pattern from the following list within a longer textual string.

The component removes the duplicates from the input (if present) and then joins the input fields together. As a result, the component can be used even when all the information (both numerical value and unit) is entered into one field.

  • Allowed value patterns:

    • D+

    • D+ (DS) D+

    • D+ [eE] D+

    • D+ [xX] 10^ D+

    • D+ [eE] [- ] D

    • D+ [xX] 10^ [-] D

    • D+ (DS) D+ [eE] D+

    • D+ (DS) D+ [xX] 10^ D+

    • D+ (DS) D+ [eE] [- ] D

    • D+ (DS) D+ [xX] 10^ [-] D

Space characters are supposed to be parsed correctly only in between the letters and digits in the scientific notation (for example, 5.6 x 10^8; 7.5 E 4) or when used to separate groups of three digits (for example, 12 400 765; 1 429.587324). The whitespaces in the patterns listed are only there for readability.
  • Not allowed (due to processing issues):

    • More than a single measurement unit in the input string.

    • Values entered as words (for example, "twelve point five").

    • More than a single value in the input string.

While there are no constraints on the decimal_separator parameter, it works best if it is either a comma or a dot (otherwise the component cannot handle digit groupings, for example).

Free

Cleanse Credit Card Number

Used to validate a credit card number and determine the card details.

  • Input: Credit card number.

  • Output: Parsed credit card number, validated credit card number, score and explanation. If found, credit card type (like Visa or Mastercard), country, and issuer authority.

Free

Cleanse Date

Used to parse dates or datetimes, validate them, apply business rules, and set them to the standard date and datetime formats. Those are 'yyyy-MM-dd' for date and 'yyyy-MM-dd HH:mm:ss.SSS' for datetime, where yyyy - year, MM - month, dd - day, HH - hours in 24-hour format, mm - minutes, ss - seconds, SSS - milliseconds.

The component has two optional parameters:

  • SCORE_FUTURE (Boolean, default TRUE): Used for scoring future dates (TRUE for scoring future dates, FALSE for not scoring future dates, empty for default).

  • SCORE_PAST_DAY (DAY, default 01-01-1900): Used for scoring dates far in the past (date in the DAY format for threshold value, empty for default, NULL for not scoring dates in past).

  • Input: Date or datetime (STRING).

  • Output: Standardized date and datetime (business validations applied), the best existing date and datetime, date pattern, score, and explanation of data discrepancies. In the case of multiple dates in the input, only the first one is standardized.

  • The date part of the input can be specified by the following patterns:

    • dd.MM.(yy)yy - With dots as separators.

    • MM/dd/(yy)yy - With forward slashes as separators.

    • (yy)yy-MM-dd - With dashes as separators.

    • yyyyMMdd - With no separators.

Where dd - one or two digits of a day, MM - one or two digits of a month, (yy)yy - two or four digits of a year. For the yyyyMMdd format, there has to be exactly eight digits (four for year, two for month, and two for day).

  • The time part of the input can be specified by the following patterns:

    • HH:mm:ss (for example, 18:03:30)

    • HH:mm:ss{Z|XXX} (for example, 18:03:30+0100, 18:03:30+01:00)

    • HH:mm:ss.SSS (for example, 18:03:30.001)

    • HH:mm:ss.SSS{Z|XXX} (for example, 18:03:30.001+0100, 18:03:30.001+01:00)

Where HH - two digits of an hour (00-23), mm - two digits of a minute (00-59), ss - two digits of a second (00-59), SSS - one to three digits of a millisecond (0-9, 00-99, or 000-999), Z - time zone in the {+ -}hhmm format (for example, +0100), and XXX - time zone in the {+ -}hh:mm format (for example, +01:00).

When there is a datetime on input, the date and time parts have to be separated either by a space or a character 'T' (for example, 2018-07-17T18:03:30).

When the source value contains the time zone part, the component adjusts the output value according to the local time zone (including daylight saving time adjustments). This means the std_datetime and out_datetime are adjusted by the time zone offset, but also std_date and out_date might change to the next or previous date, in case the time conversion crosses midnight.

In addition to numeric dates, the component is also able to validate dates with a month name or its abbreviation (first three letters). In this case, the date format contains "MMM(M)" for month (MMMM for the full month name, MMM for abbreviation).

By default, month names in 10 languages are supported: en, de, nl, fr, es, pt, ru, it, sk, cs. All except the Czech language (cs), where some abbreviations would be ambiguous, support using the month abbreviations.

You can add other languages to the lookup. See Java utiltext documentation for Locale ID (the first two letters before the underscore suit the purpose). In case the month name is used, the combination of the day number (cardinal or ordinal) and year element must be unambiguous for the date to be valid.

Examples of valid dates include: "January 1st 99", "January 1 1999". Example of invalid date: "January 1 99". In the latter case, we are able to determine that the number 99 can only mean the year, but for a different date in the same format, such as "January 12 13", this would no longer be possible.

Free

Cleanse EAN Code (complex)

Used to validate EAN codes.

It includes:

  • EAN-8 (used for small goods - for example, chewing gums and goods with restricted circulation)

  • EAN-13 (used for various products)

  • EAN-14 (or GTIN-14)

  • ITF-14 (or SSC)

  • EAN-18 (or SSCC)

  • EAN-128 codes (or GS1-128)

Input: EAN code.

Output: Standardized value of the EAN code, best existing (standardized or cleansed) value of the EAN code, data quality score of the EAN code, data quality imperfection explanations.

  • Allowed input number format:

    • Only one EAN code of defined length without extensions.

    • Valid characters in EAN codes are digits, letters, and parentheses (round brackets).

    • Other characters than those specified here can be present in an EAN code, but they are removed (and scored) before validation.

  • Not allowed (due to processing issues):

    • Several values of an EAN code in the input value (parsing fails as multiple values are considered invalid).

    • Comments with the EAN input value are not processed (the EAN code value could be processed as expected but the comments would be lost).

    • Comments with the EAN input value containing numbers are not processed (the EAN code value would be processed as one string with digits from the comment, but the value would be invalid and the comments lost).

    • Simple EAN codes containing the left parenthesis.

    • Compound EAN codes without the left parenthesis.

Free

Cleanse EAN Code (simple)

Used to validate EAN codes. It includes only EAN-13.

Input: EAN code.

Output: Standardized value of the EAN code, data quality score of the EAN code, data quality imperfection explanations, best existing value of the EAN code (standardized, cleansed, or input value).

  • Allowed input number format is:

    • Any character in the input value of the EAN code is allowed (all except digits are deleted from the value before validation).

    • Null values are processed and scored.

  • Not allowed (due to processing issues):

    • Several values of an EAN code in the input value (parsing fails as, after non-digits are removed, multiple values would be considered to be one EAN code and, hence, invalid).

    • Comments with the EAN input value are not processed (the EAN code value might be processed as expected, but the comments would be lost).

    • Comments with the EAN input value containing numbers are not processed (the EAN code value would be processed as one string with digits from the comment, but the value would be invalid and the comments lost).

Free

Cleanse Email

Used to validate and cleanse email addresses in the best possible manner.

Although this component is powerful when it comes to transforming invalid email addresses into valid ones, keep in mind that a valid email address is not the same as an existing email address; a valid address only conforms to a standard. Therefore, any change in the source value can result in a non-existent output.

In addition, not all email providers follow the RFC rules for email address format, and in that case, an existing email address is not the same as a valid email address (that is, the message is delivered even though the email address does not conform to the standard). Therefore, consider carefully what the component is used for.

  • Input: Email column.

  • Output:

    • Standardized value of the email address (score less than 10 000).

    • Data quality score of the email address.

    • Data quality imperfection explanations.

    • Best existing value of the email address (standardized, parsed, or input value).

    • Standardized value of the email address owner if recognized (person’s name is populated if recognized).

  • Allowed input email address formats:

    • Email address found in the allowlist.

    • Email address with a valid top-level domain (TLD).

    • Email address in a valid format.

    • Email address with a string before <em@il.tld>.

    • Email address in the upper case - valid with no change (local part is case sensitive).

    • Email address with special characters, such as !#$%&'\*+-/=?^_`{|}~.

    • Email address with accents and multiple consequent dots or commas used instead of dots (for example, "john,smith@email.com").

    • Email address with a leading stopword delimited from the email address by a space or colon (for example, "email: mail@dom.tld").

    • Valid email address enclosed in angle brackets (<>) having a string before the email.

    • Valid email in any type of paired enclosing strings: (), {}, <>, [], "", '' (for example, "{email}", "(email)").

    • Spam-protected versions of the email address with an at sign (@) and/or dot codes (for example, name<dot>surname<at>domain<dot>tld).

    • Email address with the most common misspellings of TLDs.

    • Email address with a missing or misspelled TLD in case of well-known email providers.

    • Email address missing the last dot between the TLD and the second level domain name.

    • Any string that contains a valid email address as a substring (delimited by characters not supported in email addresses).

    • Email address with a temporary domain (for example, 5 and 10 minute emails).

    • Any viable combinations of the previous.

  • Not allowed (due to processing issues):

    • Empty email address (or whitespaces only).

    • Email address with unsupported characters, such as בײַשפּ@בײַשפּיל.טעסט.

    • String without an email address.

    • String with multiple emails.

    • Email address with only a TLD in the domain part (for example, "john.smith@com" - though RFC-valid, it is considered suspicious).

    • Email address found in the blocklist (for example, spam addresses, business defined emails).

Free

Cleanse International Bank Account Number

Used to verify IBAN (International Bank Account Number).

  • Input: IBAN column.

  • Output:

    • Standardized value of IBAN.

    • Data quality score of IBAN.

    • Data quality imperfection explanations.

    • Best existing value of IBAN (standardized, cleansed, or input value).

  • Allowed input number formats:

    • IBAN with a maximum length of 34 characters.

    • IBAN can contain any character possible. All except digits and letters are removed before verification.

    • Empty record or input value that does not contain any alphanumeric characters (IBAN is considered NULL).

Free

Cleanse Number

Used to verify numbers (input as STRING). The output is provided in the following three formats: INTEGER, LONG, and FLOAT.

  • Input: Number column.

  • Output: Best existing value of the number in each type and data quality imperfection explanations.

  • Allowed input number formats:

    • Numbers with spaces.

    • Numbers with one decimal point or decimal comma - in the middle of the number or at the end (for example, 1,2345,3433,).

    • Numbers with digit grouping - only thousands separators (for example, 1,123,123.00109 and 1.123.123,00109).

    • Numbers with one plus or minus sign (+ or -) at the beginning.

    • Numbers can be integer, long, or float.

  • Not allowed (due to processing issues):

    • Numbers containing letters.

    • Numbers containing more than one comma or dot (except acting as thousands separators).

    • Empty input.

    • Numbers in a wrong format (not compliant with the previously stated rules).

      • For FLOAT: The part before the decimal point is longer than 100 digits.

      • For LONG: The number is longer than 100 digits.

Free

Cleanse SWIFT code

Used to verify SWIFT (Society for Worldwide Interbank Financial Telecommunication) codes. A SWIFT code, also known as the Bank Identifier Code (BIC), is a unique identification code for a particular bank.

  • Input: SWIFT code.

  • Output: Standardized value of the SWIFT code, data quality score of the SWIFT code, data quality imperfection explanations, best existing value of the SWIFT code (standardized or cleansed value).

The standardized value of SWIFT is the value which is valid according to the component behavior.

The best existing value of SWIFT:

  • Standardized value.

  • Value in a correct structure (8 or 11 positions).

  • Cleansed input value.

Free

Cleanse Vehicle Identification Number

Used to validate VIN (Vehicle Identification Number).

  • Input columns:

    • in_vin - Input VIN

  • Output columns:

    • std_vin - Standardized value of VIN. Meets all rules of the component.

    • out_vin

    • std_region

    • std_manufacturer

    • std_model_year

    • exp_vin

    • sco_vin

Free

Country Code Identifier

Used to extract a country code from an address.

  • Input:

    • in_string (STRING): Input address or string from which you want to extract the country code.

  • Input address requirements:

    • String containing any alphanumeric characters and ,;-_/|..

    • Country names must be in English.

    • Country codes must be in upper case, otherwise they are not recognized as country codes.

The component performance depends on the quality of input data. For best results, we recommend using precleansed data.

  • Output:

    • in_string (STRING): Input address or string.

    • std_cc_alpha_2 (STRING): Standardized country code in alpha-2 format.

    • std_cc_alpha_3 (STRING): Standardized country code in alpha-3 format.

    • std_cc_num_iso (STRING): Standardized numerical country code.

    • std_country_name (STRING): Standardized country name.

    • out_country_name (STRING): Names of all countries identified on input (can be used, for example, for data cleansing).

    • sco_cc (INTEGER): Scoring value of the input address based on extracted alpha-2 country codes.

    • exp_cc (STRING): Scoring value explanation.

Free

Derive Age from Birthdate

Used to derive age from the date of birth. The current or custom date can be used as a baseline.

  • Input: Date of birth column in DATE format.

  • Output: Age and data quality imperfection explanations.

  • Allowed input birth date value:

    • Date in the past.

    • Today’s date.

  • Not allowed (due to processing issues):

    • Date in the future.

    • Empty value.

  • Parameter:

    • age_unit - Choose output age units (days, weeks, months or years). Default value is "YEARS".

    • current_date - Set up the baseline in DATE format (yyyy-MM-dd). Default value is "today".

Free

Derive Name of a Day from Date

Used to derive the name of a day from a date.

  • Input: Date column in DATE format.

  • Output: Name of the day of the week.

  • Allowed input date value:

    • Date in the past.

    • Today’s date.

    • Date in the future.

  • Not allowed (due to processing issues):

    • Empty value.

  • Parameters:

    • Output language:

      • Select one of the ISO 639-1 language codes (first two letters of Locale ID).

      • Unknown or invalid value - English language is used.

    • Abbreviation flag:

      • true - Abbreviation of the name of a day is used.

      • false or invalid - Full name of a day is used.

Free

Generic String to Boolean Conversion

Used to convert certain strings with Boolean meaning (0/1, true/false) to Boolean type.

  • Input: String column.

  • Output: Boolean column.

Free

Guess Contact Type

Used to guess the contact type of the input value. It can identify possible emails, webpages, and phone numbers.

Use this component in case you mostly have single contact values on input. If the component identifies only one single contact, you can use specific cleansing or validating components accordingly.

This component is approximately twice as fast as Parse Contact Type component, which can parse the first occurrence of every (supported) contact type.

  • Input: Contact string.

  • Output: Contact type, cleansing explanation.

Free

Hierarchical Union Match of Person and Company

Uses generic matching rules to identify and group the same records together depending on the party type (person or company). Hierarchical union strategy is used to minimize the number of false positive results. One matching key is chosen as the strongest one (personal identifier), and records with different keys are never grouped together.

The output of the matching process strongly depends on the content and quality of input data. We recommend using preconfigured cleansing components to achieve the best matching results.

  • Input:

    • Primary key

    • Party type (person or company)

    • Gender code

    • First name

    • Middle name

    • Last name

    • Person identifier (such as social security number, social insurance number, birth number, ITIN)

    • Date of birth

    • Company name

    • Company identifier from a business registry

    • Address (permanent or contact address for person, address of company branch or headquarters for companies)

    • Contacts (such as emails, phone numbers)

    • Document identifiers (such as driver’s license)

    • Cleansing score

    • Delete flag

    • Date of last modification

  • Output:

    • Candidate group identifier

    • Previous candidate group identifier

    • Matching identifier

    • Previous matching identifier

    • Processing status

    • Processing timestamp

    • Instance role

    • Previous instance role

    • Matching rule name

Mapping of input attributes depends on your matching requirements. Invalid mapping or invalid values might cause matching errors:

  • False-positive match - Records which should not be matched together are matched.

  • False-negative match - Records which should be matched together are not matched.

All input, changed, and discarded (for example, records with a timestamp that is older than the repository) records are returned from the matching repository.

Free

Length Standardization

Used to validate and standardize submitted units of length and their values. The component can convert length to the unit set as an optional parameter.

Supported units: mm, cm, dm, km, thou (0.0000254 m), mil (0.0000254 m), line (0.002116 m), inch (0.0254 m), foot (0.3048 m), yard (0.9144 m), mile (1 609 m), league (4 828 m).

The component removes the duplicates from the input (if present) and then joins the input fields together. As a result, the component can be used even when all the information (both numerical value and unit) is entered into one field. Additional text is marked as comment and is also allowed.

While there are no constraints on the decimal_separator parameter, it works best if it is either a comma or a dot (otherwise the component cannot handle digit groupings, for example).

  • Input: value (STRING), unit (STRING).

  • Output:

    • std_value (FLOAT): Input value converted into target unit of length.

    • std_unit (STRING): Target unit of length.

    • out_value (FLOAT): The best value of input value.

    • out_unit (STRING): The best value of input unit.

    • out_comment (STRING): Parsed out comment.

    • sco_length (INTEGER): Data quality score.

    • exp_length (STRING): Explanation codes.

Rules for output: If a value and unit of length both can be parsed and the unit is verified, then the input is valid and can be processed without any transformation (example: input "15 m" converted into meters). Otherwise, the logic applied depends on the input and is described by the following explanation codes:

  • Input is valid and some transformation was applied to input:

    • LGT_VALUE_STANDARDIZED: The input value of length was standardized. Example: "14.999,999,999 meters" to "14.999999999 m".

    • LGT_UNIT_STANDARDIZED: The input unit of length was standardized. Example: "15 meters" to "15 m".

    • LGT_COMMENT_FOUND: The input contains additional information other than value and unit of length. Example: "Length is 15 meters".

    • LGT_CONVERTED: The value of length was converted to its corresponding value in the target unit. Example: "1500 centimeters" to "15 m".

    • LGT_NOT_PARSED: No valid parsing pattern was found and input was not parsed. Example: "14. m".

    • LGT_MULTIPLE_SEPARATORS: Input was parsed but more than one decimal separator was found in the value. Example: "14.999.999,999,999 meters".

    • LGT_EMPTY: No input was submitted.

  • Input is invalid:

    • LGT_VALUE_MISSING: Unit is verified but no value of length was found. Example: "meters".

    • LGT_UNIT_MISSING: Value of length is parsed but no unit was verified. Example: "15".

    • LGT_MULTIPLE_UNITS_FOUND: Input contains more than one unit of length. Example: "15 meters kilometer".

    • LGT_INVALID: No value and no unit of length were parsed. Example: "Length".

Free

Loqate Address Validate

Uses the Loqate Server Engine to verify the input address string.

The component supports the following service types:

  • Verify (V)

  • Verify & Geocode (V+G)

Does not support CASS and AMAS verification.

A valid Loqate Server license key (license.lfs) is required to run this component.
Output results depend on the Loqate Server capabilities for the given area. The Loqate Server provides different address accuracy and completeness for each country.
  • Input:

    • Primary key

    • Address lines 1-4

    • Country code

    • Postal code

  • Output:

    • Out address attributes

    • Address label

    • Explanation code

    • Data quality score

    • Original loqate address verification code

Output of Verify (V) or Verify & Geocode service (V+G) includes major output attributes as defined in Loqate: Field Descriptions. The following output attributes are available only if Geocoding parameter is enabled:

  • out_geo_accuracy

  • out_geo_distance

  • out_latitude

  • out_longitude

Free

Loqate Cloud Address Web Lookup

Uses the Loqate Cloud Service to verify or search input address string.

The component supports the following service types:

  • Verify (V)

  • Verify & Geocode (V+G)

  • Search (S)

  • Search & Geocode (S+G)

The component does not support the following service types:

  • Query (Q)

  • AMAS Verify (A)

  • CASS Verify ©

A valid Cloud Service license key is required to run this component.
Output results depend on the Cloud Service capabilities for the given area. The service provides different address accuracy and completeness for each country.
  • Input:

    • Primary key

    • Address lines 1-4

    • Country code

    • Postal code

  • Output:

    • out - Output of Verify (V) and Verify & Geocode service (V+G). Contains major output attributes as defined in the cloud service Field Descriptions and explanation codes (exp_instance).

    • out_proposals - Output of Search (S) or Search & Geocode service (S+G). Contains the same attributes as the out endpoint plus a primary key identifying the source request.

Free

Mask Credit Card Number (complex)

Used to mask credit card numbers.

  • Input: Credit card number.

  • Output: Masked credit card number.

  • Allowed input formats: digits, letters, and special characters.

Contact us

Mask Credit Card Number (simple)

Used to anonymize credit card numbers.

  • Input: Credit card number.

  • Output: Anonymized credit card number.

  • Acceptable input: Values consisting only of digits (12-19) with a delimiter allowed between the groups of four digits. Delimiter can be a hyphen or space (delimiter must be only one character: <space><hyphen><space> is considered invalid).

    • Examples of valid input:

      • 4012-3680-3379-4136

      • 501839120044

      • 6304 3356 1988 3456 223

  • Non-acceptable input: Credit cards numbers containing non-digit characters (except delimiters) within the value.

    • Examples of invalid input:

      • creditcard4012-4023-5533-2221

      • 40-12-40-23-5533-22-21

      • 40 12 40 23 5533 22 21

      • 4012 - 4023 - 5533 - 2221

Contact us

Mask Date

Used to mask arbitrary dates.

  • Input: Date (datetime - no parsing or validation).

  • Output: Masked date (datetime).

Contact us

Mask Email

Used to mask emails.

  • Input: in_email (input email).

  • Output: out_masked_email (masked email).

  • Input limitations:

    • Component does not mask several email addresses in one record (everything up to the last at sign (@) is considered the local part and is transliterated).

    • Inputs with domain names containing spaces are incorrectly parsed: everything after the space is considered a suffix (not part of the email address) and remains unchanged. Spaces are allowed only in the local part.

    • Inputs with email address and suffix starting with a letter without any separator between them are incorrectly parsed: everything after the at sign is considered a domain part. Text suffix has to be separated from the value by a non-alphanumeric character.

Contact us

Mask English Word

Used to mask words (English words are preferred).

  • Input: in_english_text (input text).

  • Output: out_english_text (masked text).

Contact us

Mask Number

Used to mask numbers or IDs.

  • Input: in_number (ONE Engine type long - no parsing or validation).

  • Output: out_number (ONE Engine type long).

Contact us

Parse Contact Type

Used to identify and parse first occurrences of contact types from the input string. It can recognize emails, webpages, and phone numbers.

Use this component when you expect multiple emails, phone numbers and/or web pages in one input column. The component tries to parse the first email, web page, and phone number into separate columns. The rest stays in its own output column.

This component is approximately two times slower than the Guess Contact Type component.

  • Input: Contact string.

  • Output: First email, first web page, first phone, unrecognized or multiple contact types, explanation.

Free

Standardize Currency

Used to:

  • Validate and standardize a currency code from the input.

  • Extract possible comments surrounding the currency code.

  • Output the standardized currency code for a valid country name, if provided.

  • Input: A currency code (with or without an additional comment) or a country name according to ISO 4217.

  • Output: Standardized currency code, best currency code value, comment, score, explanations of data quality imperfections.

The component does several transformations and standardizations:

  • Any country name from the input is replaced by its valid currency code using the ___currency_country_names.lkp lookup file. The replacement is done only for countries which have unique standardized currency code.

  • If the input consists only of currency codes, all the duplicates among them are removed ("CZK USD CZK" to "CZK USD").

  • If the input consists only of country names which cannot be uniquely replaced by a currency code (for example, Belarus), all the duplicates among them are removed ("BELARUS BELARUS" to "BELARUS").

  • All currency codes from the input are capitalized ("My currency is czk" to "My currency is CZK").

  • All country names which cannot be replaced uniquely by their currency code are capitalized ("My currency is Belarus" to "My currency is BELARUS").

The rules for the output depend on the input type:

  • Input:

    • A single valid currency code, with or without comments (not containing any ambiguous country name), for example, "My currency is CZK".

    • A single non-ambiguous country name, with or without comments (not containing any ambiguous country name), for example, "My currency is Barbados".

  • Output: out_currency and std_currency columns contain the standardized currency code, the comment is output to out_comment.

  • Input: A single invalid currency code (any three-letter word) without any comment.

  • Output:

    • If the code differs by a single letter from a standardized currency code and can be corrected uniquely, then the corrected value is output into out_currency and std_currency columns.

    • If the code cannot be corrected uniquely or at all, it is put into out_currency column; std_currency is left empty.

  • Input: Any combination of multiple currency codes, ambiguous country names, and comments.

  • Output: The transformations and standardizations described previously are performed and the transformed input is put into out_currency; std_currency and out_comment are left empty.

Free

Standardize Phone Number Format

Used to format the worldwide variety of phone numbers into the international format +<country_prefix> <phone>.

This component does not validate phone numbers. If there is an invalid Czech number in an input, for example, 737 123 456 7, the component formats it as +420 7371234567.
  • Input: Phone number, country.

  • Output: Standardized value of the number, data quality score of the number, data quality imperfection explanations.

  • Allowed input phone number formats:

    • Number in the national format. Trunk prefix is required, for example, in Slovakia it is 0, in Czech Republic there is none.

    • Number in a correct international format: begins with a plus sign (+) or the exit code followed by the country code. Only two exit codes are supported: 00 (Europe) and 011 (North America).

  • Allowed input country code formats:

    • ISO 3166-1 alpha-2 - Two-letter country codes.

    • ISO 3166-1 alpha-3 - Three-letter country codes.

    • ISO 3166-1 numeric - Three-digit country codes.

  • Not allowed (due to processing issues):

    • Phone numbers which contain any other characters than +(only at the beginning)[:digit:][:space:]-().

    • Phone numbers longer than 15 digits (after transformation into the international format).

    • Phone numbers shorter than 6 digits (without the country code).

    • Empty phone number or country column.

Free

Standardize String Length

Used to standardize a string to a string of the specified length.

  • Input: String column.

  • Output:

    • String truncated to the maximum length (out_string).

    • String with the rest after truncation (out_rest).

Free

Validate Country Code

Used to check consistency between the country name (in various languages) and the country code (any of the three types of country codes: alpha-2, alpha-3 and numeric-3). If they match or only one of them is populated and valid, it returns the country name in the preferred language (English by default) and all three types of country codes.

  • Input: Country name and/or country code, optionally out_preferred_language.

  • Output: Country name in the preferred language (if supported, otherwise English), score input parameters (consistency, support), input parameter explanations.

  • Allowed input country names:

    • All country names in Dutch, English, French, Spanish, Portuguese, Russian, English (British), Italian, and German.

    • Country names in other languages - up to 30 (depending on the country).

  • Allowed input country codes:

    • ISO 3166-1 alpha-2 - Two-letter country codes.

    • ISO 3166-1 alpha-3 - Three-letter country codes.

    • ISO 3166-1 numeric - Three-digit country codes.

  • Preferred input languages for the output country name:

    • nl, en, fr, es, pt, ru, uk, it, de, cz_iso, en_iso.

    • Empty or not allowed to en_iso.

    • _iso preferred language parameter - official ISO country short name.

  • Not allowed (due to processing issues):

    • Empty country name and country code.

    • Invalid country name and/or invalid country code.

    • Country name and country code filled in and valid but inconsistent.

Free

Validate Email

Used to validate of email addresses.

  • Input: Email column.

  • Output:

    • Standardized value of the email address (score less than 10 000).

    • Data quality score of the email address.

    • Data quality imperfection explanations.

    • Best existing value of the email address (standardized, parsed, or input value).

    • Standardized value of the email address owner, if recognized (a person’s name is populated if recognized).

  • Allowed input email address formats:

    • Email address with a known TLD.

    • Email address in a valid format.

    • Email address with a string before <email>.

    • Valid email address in angle brackets (<>).

    • Email address in upper case - valid with no change.

    • Email address with special characters, such as !$%&'*+-/=?^_`\{|}~.

  • Not allowed (due to processing issues):

    • Empty email address (or whitespaces only).

    • Email address in an invalid format.

    • Email address with accents (ˇ´).

    • Email address with an unknown TLD.

    • Email address with a string before the email address (with email address not being inside angle brackets <>).

    • Valid email address in quotation marks.

    • Invalid email address inside apostrophes and square brackets.

    • Email address with unsupported characters, such as בײַשפּ@בײַשפּיל.טעסט.

Free

Validate International Dialing Code

Used to validate international dialing codes (IDCs).

  • Input: IDC code column or phone number column (only one input column is necessary, validation is done on the first non-null column in this order: IDC, phone number).

  • Output:

    • Valid IDC code.

    • Explanation of data discrepancies in the IDC.

    • Data quality score of IDC code.

  • Allowed input formats:

    • IDC (phone number) containing only numbers and having an indication of a foreign dialing code (+, 00, or 011 - the most common codes). Other codes can be found at International Dialling Codes.

      Allowed code values can be changed in step 2 of the component (see [Technical description]).

  • Not allowed (due to processing issues):

    • Numbers or IDCs with alphanumeric comments (numbers in comments are considered to be part of the number or IDC).

    • Multiple values in one column.

Free

Bulgaria

Component name Description Availability

Cleanse Person Name BG

Used to verify a person’s name, determine that person’s gender, split the full name into separate columns, and identify Latin and Cyrillic name analog.

  • Input: First name, last name, and middle name. If there is a full name in the input, the last name column can be used for it.

  • Output: First name, last name, middle name with a scoring and explanation of all discrepancies in the data. In the output there is also the pattern and gender with a scale 1-5 (1 for male, 5 for female, 3 for unknown, the values between describe the level of certainty). Additional columns with suffix _lat and _cyr have Latin and Cyrillic analog of out value.

  • Rules for output:

    • Name is not parsed - out_ and std_ values are empty.

    • Name is parsed and the whole name is verified - out_ and std_ values are the same. Both are capitalized.

    • Name is parsed and part of the full name is not verified - all out_ columns and some std_ columns are populated. std_ is populated only when all words in its corresponding out column are verified.

    • Initials go to the output without dots. If there are two initials and the first name is empty, then the first initial is set to the first name and the second initial is set to the middle name.

    • If the parsed last name contains a dash, then output values don’t contain spaces around (for example, John - William would be John-William). Other special characters are not allowed in the output except apostrophe.

    • std values are set if the parsed value is verified in the lookup but std is set from the parsed original value and not from the lookup value.

    • out columns have output in same alphabet as the input.

    • out lat columns have only Latin values.

    • out cyr columns have only Cyrillic values.

Free

Transliterate Cyrillic

The component is used to transform Bulgarian Cyrillic input into Latin output.

  • Input: in_value (STRING).

  • Output:

    • out_value (STRING) - Transformed value.

    • sco_value (INTEGER) - Data quality score.

    • exp_value (STRING) - Explains data transformations.

Free

Canada

Component name Description Availability

Address Identifier CA (step)

Used for cleansing, standardization, and enrichment of CA addresses.

  • Input: Address attributes (street, municipality, province, post code, additional address information).

  • Output: Best existing values of address attributes (street, municipality, province, post code), validity level, and data quality imperfection explanations and scoring.

Contact us

Cleanse Business Number CA

Used to validate Canadian Business Number. The business number (BN) is a common client identifier for businesses to simplify their dealings with federal, provincial, and municipal governments. It is based on the idea of one business, one number. Each business requires one BN for its legal entity.

  • Input: in_business_number.

  • Supported business number input patterns:

    • DDDDDDDDDLLDDDD

    • DDDDDDDDD*LL*DDDD

    • DDDDDDDDD LL DDDD

    • DDD DDD DDD LL DDDD

    • DDD-DDD-DDD-LL-DDDD

Additional text before and after the identified business number is moved to the special output column.

  • Output:

    • std_business_number - Valid business number. This value is set only if the number matches all critical validation conditions (score < 10000).

    • out_business_number - Best business number value that we can get. Major corrections are considered as not safe and in that case only out_ value is present.

    • std_major_program_account

    • exp_business_number

    • sco_business_number

The identified business number is standardized to the following format: DDDDDDDDDLLDDDD.

Free

Cleanse Company Name CA (simple)

Used to clean Canadian company names (CN) and derive its legal form.

  • Input: in_company_name (input company name).

  • Output:

    • std_company_name - Standardized value of company name (score < 10000).

    • out_company_name - Best known value of company name.

    • out_company_name_base - Company name without legal element.

    • std_legal_form - List of standardized legal forms found in company name.

    • exp_company_name - Explanation column.

    • sco_company_name - Score column.

  • Form of legal element:

    • std_company_name column contains the input form of legal entity; the component only standardizes the case and adds periods to legal elements if needed (for example, CORPORAtioN to Corporation, CORP to Corp., Ltd to Ltd.).

    • std_legal_form column contains a single standardized form of legal entity (full text, without abbreviations: for example, CORPORAtioN to Corporation, Corp. to Corporation, Ltd to Limited).

      Therefore, for matching, we recommend using out_company name_base + std_legal_form.

Free

Cleanse Person Name CA

Used to verify a person’s name, determine gender, and split name into separate columns.

  • Input: First name, last name, and middle name. If there is a full name on input, the last name column can be used for it.

  • Output: First name, last name, middle name, scoring and explanation of all discrepancies in the data. On the output, there is also the pattern, stop words, synonyms for the first name and the middle name, gender (with a scale 1-5, where 1 is male, 5 female, 3 unknown, and the values in between describe the certainty level).

  • Rules for the output:

  • If the name is not parsed, then out_ name columns are populated with input values without stop words. If names are the same as the stop word full value, then the output value is empty. std_ columns are empty.

  • If the name is parsed and the whole name is verified, then out_ and std_ values are the same. Both are capitalized.

  • If the name is parsed and part of the name is not verified, then all out_ columns and some std_ columns are populated. std_ is populated only when all words in the corresponding out column are verified.

  • Initials go to the output without dots. If there are two initials and the first name is empty, then the first initial is set to the first name and the second initial is set to the middle name.

  • If the parsed last name contains a dash, then the output values do not contain spaces around (for example, John - William would be John-William). Other special characters are not allowed in the output, except the apostrophe.

  • std values are set if the parsed value is verified in a lookup, however, std values are set from parsed original values, not from the lookup values.

Free

Cleanse Phone Number CA (complex)

Used to cleanse Canadian phone numbers. The component can also clean and validate foreign numbers (only on the level of international dialing code, that is, IDC code). The component validates standard Canadian numbers, however, it does not include non-regional and special numbers like toll-free numbers, emergency, or non-standard contact numbers.

In comparison to the simple version, this component can identify foreign numbers, fictive numbers, comments, extensions, and intervals in number. The format of a valid output number can be modified.

  • Input: Phone number column.

  • Output:

    • Standardized value of number.

    • Data quality score of number.

    • Data quality imperfection explanations.

    • Best existing value of number (standardized, foreign, or parsed value).

    • Best existing IDC (standardized or parsed value).

    • Best existing value of area code (value verified in the lookup file or parsed).

    • Best existing value of central office code (value verified in the lookup file or parsed).

    • Best existing value of station number (parsed value).

    • Comments.

    • Pattern of the valid output phone number (more in chapter Attribute Details, the out_pattern attribute).

  • Allowed input number formats:

    • Number with Canadian IDC(1) or without IDC.

    • Number with extension (1- to 5-digit long number).

    • Number with interval in the format <number>-<interval>.

    • Foreign numbers indicated by trunk code 00 or a plus sign (+) in front of the number. IDC 1 is not considered a foreign number; numbers from Canada and other countries in the North American numbering system are considered invalid Canadian numbers; to standardize these numbers, use us_phone_number_cleanse component.

    • Number with textual comment before and after the number.

    • Number can contain only digits, letters, dashes, round brackets, plus signs, hashes, white spaces, dots, colons, commas.

  • Not allowed (due to processing issues):

    • Foreign numbers not indicated by 00 or a plus sign (+) in front of the number. These numbers are considered Canadian and might even match some lookup file values.

    • Numbers with period where the period number has more than two digits.

Free

Cleanse Phone Number CA (simple)

Used to cleanse Canadian phone numbers.

  • Input: Phone number column.

  • Output:

    • Standardized value of number.

    • Data quality score of number.

    • Data quality imperfection explanations.

    • Best existing value of number (standardized, parsed, or input value)

    • Best existing IDC (parsed value).

    • Best existing value of area code (value verified in the lookup file or parsed).

    • Best existing value of central office code (value verified in the lookup file or parsed).

    • Best existing value of station number (parsed value).

  • Allowed input number formats:

    • Number with a Canadian IDC (in the form +1, 1, or 001) or without IDC.

    • Number can contain any character possible. All except digits are removed before parsing.

    • Number with comment is parsed but the comment is lost.

    • Empty record or input value that does not contain any characters - number is considered NULL.

  • Not allowed (due to processing issues):

  • Foreign numbers (considered invalid).

  • Number with extension or interval (considered invalid).

Free

Cleanse Social Insurance Number CA

Used to validate Canadian Social Insurance Number (SIN).

  • Input: Social insurance number (SIN).

  • Output: SIN (standardized and best existing value) with a scoring and explanation of all discrepancies in the data.

  • The allowed formats of input SIN depend on optional comments before and/or after SIN:

    • If comments do not contain any digit, less strict format requirements apply:

      • Nine digits only.

      • Nine digits with the same separators (spaces or dashes) at correct positions (between 3th-4th and 6th-7th digit).

      • Nine digits with different separators (space, dash, no separator) at correct positions (between 3th-4th and 6th-7th digit).

      • Nine digits with spaces anywhere in between.

    • If comments contain digits, only SINs in the ddddddddd (nine digits) or ddd-ddd-ddd format are searched for in the input string. If there are multiple SINs, all except the first one are considered as comments.

The standardized value of SIN is the value which is valid according to the component behaviour (9-digit-only value).

The best existing value of SIN is as follows:

  • Standardized value.

  • Parsed and cleansed value - has the SIN format but is not verified in the ONE Desktop SIN Validator step.

  • Input value.

Free

Generate Dummy Party and Contact Data

Used to randomly generate a specified number of records from party domain using Canadian context (names, conventions, etc.). Use this component if you wish to quickly create a set of fake but realistically looking data to test your solution with specific data volumes.

  • Input: None.

  • Output: Two data flows - party and contact with respective attributes.

Free

Mask Address CA

Used to mask Canadian addresses in a smart way so that some validity and cross-validity is preserved.

  • Input: Canadian address separated into the following components: street line, municipality, province, postal code.

  • Output: Masked Canadian address separated into the same components as the input: street line, municipality, province, postal code.

  • The expected input format for street line (depends on address type):

  • Civic address:

    • UNIT_NUMBER NUMBER STREET_NAME STREET_TYPE DIRECTION

    • NUMBER STREET_NAME STREET_TYPE DIRECTION UNIT_TYPE UNIT_NUMBER

  • Civic address served by rural route:

    • STREET_NAME STREET_TYPE DIRECTION NUMBER UNIT_TYPE UNIT_NUMBER RR_IDENTIFIER RR_NUMBER

  • PO Box:

    • POBOX_IDENTIFIER POBOX_NUMBER INSTALLATION_TYPE INSTALLATION_NAME

  • Rural routes:

    • RR_IDENTIFIER RR_NUMBER INSTALLATION_TYPE INSTALLATION_NAME

  • General delivery:

    • GD_IDENTIFIER INSTALLATION_TYPE INSTALLATION_NAME

Municipality, province, and postal code are expected to be in the correct input column, upper case letters, and valid. In case of masking arbitrary invalid data, the input format is not specified. However, correct mapping of components to input columns is required in all cases.

Contact us

Mask Company Name CA

Used to mask Canadian company names.

  • Input: Company name.

  • Output: Masked company name.

Contact us

Mask Person Name CA

Used to mask a person’s name.

  • Input: First name, last name, and middle name.

  • Output: Masked first name, masked last name, and masked middle name. The name can contain any characters if it is found in the lookup with masked names. If the name is not found, then any letter in the name is translated to another letter. Special characters and numbers in the name are not translated.

Contact us

Mask Phone Number CA

Used to mask Canadian phone numbers.

  • Input: Input phone number (international dialing code, area code, central office code, station number).

  • Output: Masked phone number (CA international dialing code or random code, masked area code, masked central office code, masked station number).

  • Valid or recognized input format:

    • {1}\{area code}\{central office code}\{station number}

    • \{+1}\{area code}\{central office code}\{station number}

    • {001}\{area code}\{central office code}\{station number} \{area code}\{central office code}\{station number}

    • {1} \{area code} \{central office code} \{station number}

    • \{+1} \{area code} \{central office code} \{station number}

    • {001} \{area code} \{central office code} \{station number} \{area code} \{central office code} \{station number}

    • {1} (\{area code}) \{central office code} \{station number}

    • \{+1} (\{area code}) \{central office code} \{station number}

    • {001} (\{area code}) \{central office code} \{station number} (\{area code}) \{central office code} \{station number}

Where idc is 1, +1, or 001, area code a value verified in the lookup file, central office code a value verified in the lookup file, station number a value with length of four digits, and area code and central office code are consistent. Comments before and/or after the phone number are allowed for each pattern.

  • Invalid or unrecognized input:

    • Not parsed phone number.

    • Incomplete Canadian phone number.

    • Multiple phone numbers.

All digits are masked randomly and other characters are left as they are.

Contact us

Mask Social Insurance Number CA

Used to mask Canadian SIN.

  • Input: SIN.

  • Output: Masked SIN.

Contact us

SERP SoA Report Builder CA

Used to generate the Statement of Accuracy (SoA) for Canada Post Software Evaluation and Recognition Program (SERP).

Canada Post uses a Software Evaluation and Recognition Program for Address Accuracy under which software developers can evaluate their address preparation software packages (a “Software Package” from here on) to determine if the Software Package meets Canada Post’s current criteria to qualify as “Recognized Software”. If a developer’s Software Package meets the criteria, Canada Post issues a “Notice of Recognition” declaring it to be a Recognized Software for the period specified in the notice (the “Recognition Period”).

The Software Evaluation and Recognition Program (Address Accuracy) is a program to enable Canada Post Corporation’s (CPC) customers to benefit from incentive postage rates based on the accuracy of the data they use to address mail which is inducted to CPC.

  • Input: Address labels - overall classification of the address produced by the Address Identifier CA step ("certification" mode must be enabled), explanation codes - specific error or information codes produced by the Address Identifier CA step.

  • Output: There is no standard component output interface. The component generates only a single text file—-Statement of Accuracy-—containing the following information:

    1. Customer Name and Address

    2. Canada Post Customer Number

    3. Total Number of Records Processed

    4. Address Accuracy Level (%)

      1. Questionable Apartment Addresses (%)

      2. Questionable Rural Addresses (%)

    5. Address Accuracy Expiry Date (yyyy/mm/dd)

    6. Software Company Name and Software Version

    7. Canada Post Address Data Used (yyyy/mm/dd)

The % values listed here are required. The definition is valid according to Address Accuracy Handbook 2014 provided by Canada Post.

Free

Czech Republic

Component name Description Availability

Address Quick Search CZ

Used to propose addresses based on the input string containing incomplete parts of address.

  • Input: Address in supported pattern.

  • Output: Number of proposals, proposed addresses, explanation (if there was a problem with the input).

Contact us

Address Identifier CZ

Used to cleanse and parse input address data in order to assign an address code (which is known as an address identification). Address code is a unique ID in the official Czech address registry (RÚIAN).

The core logic of the component is delivered by a set of Address Identifier steps with a predefined set of parsing rules, however, the component also works with replacement dictionaries. In addition to standard values, it also provides output address in the envelope-ready printable form. The component also supports evidentiary numbers ("evidenční číslo" in Czech).

  • Input: Address defined by three attributes (representing three address lines, which is a usual address format in the Czech Republic). Not all attributes have to be filled in.

Input requirements: As a standard in the Czech Republic, the first address line usually contains the street or the city district or part followed by a number, the second line usually contains the city or the city district or part, and the third line usually contains the postal code. The address structure is complex and the component can handle multiple situations when some of the information is missing (when there is no street net, for example, or by an error in ETL processes). However, the input address lines must contain enough information to unambiguously find the address code in the register.

From this point of view, the component is capable of handling the following scenarios:

  • Some input values are optional depending on the presence of the others, for example:

    • When the street, street number, and city are specified, then the postal code is not required.

    • When the city district, number, and postal code are unambiguous enough, then the city is not required.

  • The presence of input values presence and mapping varies, for example:

    • All of the input address attributes are filled: the street and number on the first address line, city on the second, and postal code on the third.

    • Two of the input address attributes are filled while the third is empty: the street and number on the first address line, city on the second.

    • Only the first input address attribute is filled while the rest is empty: the city, street, and number on the first address line.

  • Output:

    • Set of standardized address attributes (including std_address_code). These attributes are filled for the identified addresses and remain empty for unidentified addresses.

    • Best existing value of the address lines values. For the identified addresses, it includes standardized values in the envelope-ready printable form. For unidentified addresses, it is a copy of the input address lines.

Free

Cleanse Company Name and Registration Number CZ

Used to validate and standardize company registration number and company name.

  • Input: Registration number, company name, city of residence.

  • Output:

    • Standardized registration number, company name, and legal form (they all have to be verified in RES lookup).

    • Best existing value of registration number (verified, valid, parsed, or input), company name, legal form, and active flag.

    • Matching value of company name without legal form. It can be used for matching of unidentified companies.

    • Score and explanation of data discrepancies. Input record gets score 0 when it has verified or valid registration number and a corresponding company name and city with RES values.

The standardized registration number has the dddddddd format (eight digits).

Free

Cleanse Phone Number CZ

Used to cleanse and validate Czech phone numbers. The component validates all Czech numbers, including non-regional and special numbers like VOIP, shared-price, audiotex (nine digit numbers). It does not include emergency or non-standard contact numbers, for example, 112, 150, 155, 156, 158, 1188, 116111.

  • Input: Phone number.

  • Output: Standardized and best existing value of input phone number, phone extension number, scoring and explanation of all discrepancies in the data.

Example of standardized phone number: +420777111222.

Free

Cleanse VAT Number CZ

Used to validate and standardize Czech VAT numbers. The component does not check whether the VAT number exists, only the checksum digits.

  • Input: Any string.

  • Output: std VAT number, best existing VAT, score, and data discrepancies.

The standardized VAT number is in format CZdddddddd (d = digit). There can be eight to ten digits.

Free

Decline Name CZ

Used to decline Czech names into 4th and 5th grammatical case (accusative and vocative).

  • Input: First name, middle name, last name, gender.

  • Output: First name in accusative, middle name in accusative, last name in accusative, first name in vocative, middle name in vocative, last name in vocative, explanation, score.

  • Input requirements:

    • First name and middle name can contain only one word (letters only).

    • Last name can have multiple words separated by a space or dash. However, only the last word is declined.

    • Gender and at least one name part are populated.

Free

France

Component name Description Availability

Cleanse Person Name FR

Used to verify a French person’s name, determine gender, and split the name into separate columns.

  • Input: First name, last name, and middle name. If there is a full name in the input, any of the input columns can be used for it.

  • Output: First name, last name, and middle name with a scoring and explanation of all discrepancies in the data. In the output, there are also columns with additional information, like the pattern of the name, titles, and gender.

  • Rules for the output:

    • If the name is parsed and the whole name is verified, then out_ and std_ values are the same. Both are capitalized.

    • If the name is parsed and part of the name is not verified, then all out_ columns and some std_ columns are populated. std_ values are populated only when all words in the corresponding out column are verified.

    • Initials go to the output without dots. If there are two initials and the first name is empty, then the first initial is set to the first name and the second initial is set to the middle name.

    • If the parsed last name contains a dash, then the output values do not contain spaces around (for example, Shantel - Leroy would be Shantel-Leroy). Other special characters are not allowed in the output, except for the apostrophe.

    • If the parsed last name contains more than two last names separated with dashes, then the dashes are not shown in the output values (for example, Ladois-Bourgois-Delage would be Ladois Bourgois Delage).

    • std values are set if the parsed value is verified against the lookup. Capitalized source (original-parsed) values are used.

Free

Cleanse Phone Number FR

Used to cleanse and validate French phone numbers. The component validates all French numbers, including non-regional and special numbers like VOIP, shared-price, audiotex (nine digit numbers). It also accepts emergency and non-standard contact numbers, for example, 15, 17, 18, 112.

  • Input: Phone number.

  • Output: Standardized and best existing value of input phone number, phone extension number, scoring and explanation of all discrepancies in the data.

Example of standardized phone number: +33677666555.

Free

Cleanse Social Security Number FR

Used to validate French Social Security Numbers (Numéro de sécurité sociale).

  • Input: Social Security Number (NSS).

  • Output: NSS (standardized and best existing value) with a scoring and explanation of all discrepancies in the data.

  • Output columns:

    • Standardized value of NSS (cleansed and valid NSS).

    • Standardized value of the owner’s place of birth.

    • Standardized value of the owner’s gender.

    • Standardized value of the owner’s year of birth.

    • Standardized value of the owner’s month of birth.

    • Standardized value of the department where the owner was born.

    • Standardized value of the commune where the owner was born.

    • Standardized value of the postal code where the owner was born.

    • Standardized value of order number (owner’s birth order, in the specified location, year, and month).

    • Standardized value of control key (last two digits of the NSS).

    • Standardized value of proposed control key (proposed in case that the control key is empty or different than the computed one).

    • Best known value of NSS (standardized, cleansed, or input value).

    • Comments found before or after the NSS.

    • Data quality imperfection scoring.

    • Data quality imperfection explanations.

The standardized SSN would be evaluated from any input containing 13-15 characters, which fits the conditions explained in the section about the component behavior.

Free

Cleanse VAT Number FR

Used to validate and standardize French VAT number. The component does not check whether the VAT number exists, only the checksum digits.

  • Input: Any string.

  • Output: std VAT number, best existing VAT, score, and data discrepancies.

The standardized VAT number has the FRccddddddddd (c = character, that is, letter or digit, d = digit) format.

Free

Germany

Component name Description Availability

Cleanse Company Name DE (simple)

Used to cleanse German company names.

  • Input: in_company_name (input company name).

  • Output:

    • std_company_name - Company name which meets all the rules (score is < 10000).

    • out_company_name - Best existing value of company name.

    • std_legal_form - Legal form derived from company name.

    • out_company_name_base - Company name without the legal element.

    • exp_company_name - Explanation column.

    • sco_company_name - Score column.

  • Form of legal element:

    • std_company_name column contains the input form of legal entity; the component only standardizes the case and adds periods to legal elements if needed (for example, CORPORAtioN to Corporation, CORP to Corp., Ltd to Ltd.).

    • std_legal_form column contains a single standardized form of legal entity (full text, without abbreviations: for example, CORPORAtioN to Corporation, Corp. to Corporation, Ltd to Limited).

      Therefore, for matching, we recommend using out_company name_base + std_legal_form.

Free

Cleanse Person Name DE

Used to verify a person’s name, determine gender, and split name into separate columns.

  • Input: First name, last name, and middle name. If there is a full name in the input, the last name column can be used for it.

  • Output: First name, last name, middle name with a scoring and explanation of all discrepancies in the data. In the output, there is also the pattern, titles, gender (with a scale 1-5, where 1 is male, 5 female, 3 unknown, and the values in between describe the certainty level).

  • Rules for output:

    • If the name is parsed and the whole name is verified, then out_ and std_ values are the same. Both are capitalized.

    • If the name is parsed and part of the name is not verified, then all out_ columns and some std_ columns are populated. std_ is populated only when all words in the corresponding out column are verified.

    • Initials go to the output without dots. If there are two initials and the first name is empty, then the first initial is set to the first name and the second initial is set to the middle name.

    • If the parsed last name contains a dash, then the output values do not contain spaces around (for example, John - William would be John-William). Other special characters are not allowed in the output, except apostrophe.

    • std values are set if the parsed value is verified in the lookup. Capitalized source (original-parsed) values are used.

Free

Validate Phone Number DE

Used to verify a phone number and split a comment into separate columns. The component can verify mobile phone numbers and geographic phone numbers.

  • Input: Phone number.

  • Output: Standardized value of phone number, data quality score of phone number, data quality imperfection explanations, best existing value of phone number (standardized, cleansed, or input value), and comment value.

  • Allowed input number formats:

    • Phone number with or without trunk code.

    • Phone number with or without IDC code.

    • Phone number with or without a preceding or following comment.

The standardized value of the phone number is the value which is valid according to the component behavior.

The best existing value of phone number:

  • Standardized value

  • Cleansed value

  • Cleansed value not verified in the lookup

  • Parsed value

  • Input value

If the phone number is with IDC, then it starts with a plus sign (+).

Free

Russia

Component name Description Availability

Address Identifier RU

Used to cleanse, standardize, and enrich Russian addresses using FIAS reference data.

  • Input: Single address line in the following recommended structure: <region and region type> <area and area type> <city and city type> <locality and locality type> <street and numbers>. Other input patterns can lead to unknown address identification.

  • Output: Best existing values of address attributes (street, locality, region, city, state, postcode), validity level, and data quality imperfection explanations and scoring.

Contact us

Slovakia

Component name Description Availability

Address Identifier SK

Used to cleanse and parse input address data in order to assign an address code (which is known as address identification).

Address code is an unique ID in the official Slovak address registry (Register adries - data.gov.sk). The core logic of the component is delivered by a set of Address Identifier steps with a predefined set of parsing rules, however, the component also works with replacement dictionaries. In addition to standard values, it also provides the output address in the envelope-ready printable form.

  • Input: Address defined by three attributes (representing three address lines, which is the usual address format in Slovakia). Not all attributes have to be filled in.

Input requirements: As a standard in Slovakia, the first address line usually contains the street or the city district or part followed by a number, the second line usually contains the city or the city district or part, and the third line usually contains the postal code. The address structure is complex and the component can handle multiple situations when some of the information is missing (when there is no street net, for example, or by an error in ETL processes). However, the input address lines must contain enough information to unambiguously find the address code in the register.

From this point of view, the component is capable of handling the following scenarios:

  • Some input values are optional depending on the presence of the others, for example:

    • When the street, street number, and city are specified, then the postal code is not required.

    • When the city district, number, and postal code are unambiguous enough, then the city is not required.

  • The presence of input values presence and mapping varies, for example:

    • All of the input address attributes are filled: the street and number on the first address line, city on the second, and postal code on the third.

    • Two of the input address attributes are filled while the third is empty: the street and number on the first address line, city on the second.

    • Only the first input address attribute is filled while the rest is empty: the city, street, and number on the first address line.

  • Output:

    • Set of standardized address attributes (including std_address_code). These attributes are filled for the identified addresses and remain empty for unidentified addresses.

    • Best existing value of the address lines values. For the identified addresses, it includes standardized values in the envelope-ready printable form. For unidentified addresses, it is a copy of the input address lines.

Free

Address Quick Search SK

Used to propose addresses based on the input string containing incomplete parts of address.

  • Input: Address in the supported pattern.

  • Output: Number of proposals, proposed addresses, explanation (if there was a problem with input).

Contact us

Slovenia

Component name Description Availability

Cleanse Person Name SI

Used to verify a person’s name, determine the gender, and split the full name into separate columns.

  • Input: First name, last name, and middle name. If there is full name in the input, the last name column can be used for it.

  • Output: First name, last name, middle name with a scoring and explanation of all discrepancies in the data. The output also contains the pattern and gender with a scale 1-5 (1 for male, 5 for female, 3 for unknown, the values between describe the level of certainty).

  • Rules for output:

    • The name is not parsed - out_ and std_ values are empty.

    • The name is parsed and the whole name is verified - out_ and std_ values are the same. Both are capitalized.

    • The name is parsed and part of the full name is not verified - all out_ columns and some std_ columns are populated. std_ is populated only when all words in the corresponding out column are verified.

    • Initials go to the output without dots. If there are two initials and the first name is empty, then the first initial is set to the first name and the second initial is set to the middle name.

    • If the parsed last name contains a dash then the output values don’t contain spaces around (for example, John - William would be John-William). Other special characters are not allowed in the output except apostrophe.

  • std values are set if the parsed value is verified in the lookup but std is set from the parsed original value and not from the lookup value.

Free

Cleanse Phone Number SI

Used to cleanse and validate Slovenian phone numbers. The component validates all Slovenian numbers, including non-regional and special numbers like VOIP, shared-price, audiotex (9-digit numbers). It also accepts emergency and non-standard contact numbers, for example, 112.

  • Input: Phone number.

  • Output: Standardized and best existing value of input phone number, phone extension number, scoring, and explanation of all discrepancies in the data.

Example of standardized phone number: +38631222111.

Free

Technical

Component name Description Availability

Accumulate Counter

Used to simulate the accumulate counter (increase sequence or get actual value of the sequence).

  • Input: Increase (Boolean).

  • Output: Accumulate counter - sequence(0, step_size) which is increased by 'true' records and remains same as the previous record if 'false'.

Free

Combine Words

Used to make 2-to-10-element combinations from input words.

  • Input: String column with words separated by the input delimiter parameter.

  • Output: All combinations of the input words. Combinations are separated by the input delimiter parameter; elements in one combination are separated by the output delimiter parameter.

The processing of long inputs can be slow due to a high number of possible combinations.

Free

DTAUS Reader

Used to prepare Generic Data Reader step for reading DTAUS file (Datenträgeraustauschverfahren).

  • Input: No input interface defined for this component.

  • Output: File content, part C, file body. Names of columns are in English but the description is in German.

Rules for output: The output contains the part C only. The part A (header information) and part E (footer information) are also included but not added to the output.

Free

Find Related Words

Used to find words related to the input word. The relation is defined by the input symbol, which is used also in the WordNet Project. The component works with synonyms of all possible contexts or meanings together - for example, "brother" as "sibling" and "brother" as "monk".

This is a technical component that can be used for building a special dictionary or component. Althought the component can be used in some linguistics- or statistics-based applications, using it in a precise production flow should be done only with caution.

It can be used (more or less directly):

  • As a source of antonyms (in_relation_symbol “!”, max_recursion_level = 1).

  • For drilling in the word meaning hierarchy ("writing table" is an instance of "working table", which is an instance of "TABLE"; what parts the human body consists of, etc.).

  • For creating a specialized domain dictionary for text analyses. For example: "What different names do people use for things designed as 'seat'?" "Chair", "armchair", and so on.

Moreover, it contains a functionality that can be used for work with synonyms.

  • Input: Word or collocation (multiple words), relation symbol, optional part of speech signs separated by a space, recursion level (1-12).

  • Output: Related words and synsets (meanings) of input words in defined (transitive) levels. Multiple related words are separated with semicolons, words in collocation with spaces.

Free

Generate Typo

Used to generate random typos in the input string.

  • Input: Any string, number of typos (restricted to values 1 or 2; for zero or negative values, no typos are generated; for three or more, only two typos are generated).

  • Output: String with typos.

Free

Search Company Details

Used to search additional information about a company in the lookup by its name.

  • Input: Company name, country code (a valid ISO 3166-1 alpha-2 code).

  • Output:

    • Information from the lookup - company name, country code, lookup source, ID in source, address, phone, email, etc. (depending on the columns present in the lookup).

    • Scoring and explanations of all discrepancies in the data.

  • Rules for output:

    • The output contains information from the lookup.

    • If the company name is not found in the lookup, the output columns are empty.

    • Some columns might contain several values (for example, phone, email, and the like).

    • Each value is enclosed by single quotes and values are separated by semicolons.

Free

Smart Character Replacement

Used to replace individual characters in the input string by characters from a replacement string based on their position.

  • Input: Any string, replacement string (the length of this string must equal to the number of characters to be replaced).

  • Output: String with replaced characters.

Examples (put the phone number to different formats or masks):

in_string in_replacement out_string

phone number xxx xxx xxx

123456789

phone number 123 456 789

phone number (xxx) xxx-xxx

123456789

phone number (123) 456-789

Free

Smart Hash Function for All Characters

Used to hash individual characters (vowel to vowel, consonant to consonant, digit to digit, special character to special character).

  • Input: Any string.

  • Output: String with hashed characters.

Free

Smart Hash Function for Digits

Used to hash digits in the input string by using a seed table (digits in records with the same trashNonDigit(input) value are hashed by the same hash function).

  • Input: Any string.

  • Output: String with hashed digits (all non-digit characters remain unchanged).

Free

United Kingdom

Component name Description Availability

Address Identifier GB (step)

Used to cleanse, standardize, and enrich GB addresses and, if possible, find UDPRN (Unique Delivery Point Reference Number).

  • Input: Five address lines and postcode attribute.

  • Output: Best output address data (validated, precleansed, or input) in seven components (out_building, out_thfare, out_locality, out_post_town, out_postcode, out_dps, out_udprn), cleansing code, score address, address label, and address validity level.

Contact us

Cleanse Company Name GB (complex)

Used to standardize British company names and its legal form using available dictionaries.

  • Input:

    • Company name

    • Company identification number (optional)

  • Output:

    • Company name (standardized, best existing value, and value without the legal form) with a scoring and explanation of all discrepancies in the data.

    • Legal form (standardized, best existing value, and value derived from the company name in the full text version).

    • Company identification number (standardized and best existing value).

  • Company name:

    • Standardized value:

      • Value verified in the dictionary.

      • Value corrected according to cleansing rules defined in the component behavior.

    • Best existing value:

      • Standardized value.

      • Cleansed or parsed value (legal form identified). Keeping the input form of legal entity. The value of the legal form is capitalized and a dot is added if needed (CORPORAtioN to Corporation, Ltd to Ltd., CORP to Corp.).

      • Input value.

    • Matching value recommendation:

      • Use additional columns out_company_name_base plus out_legal_form_full for matching.

      • out_company_name_base is company name without the legal form.

      • out_legal_form_full is legal_form derived from the company name and represented in full text.

  • Legal form:

    • Standardized value:

      • Value verified in the dictionary based on the company name - full text, without abbreviations.

    • Best existing value:

      • Standardized value.

      • Value derived from the company name - unique standardized legal form (full text, without abbreviations; Corp. to Corporation, LIMITED to Limited, LTD. to Limited).

  • Company number:

    • Standardized value:

      • Value verified in the dictionary based on company name.

    • Best existing value:

      • Standardized value.

      • Cleansed value - only numbers and letters.

      • Input value.

Free

Cleanse National Health Service Number GB

Used to validate Britain National Health Service Number (NHS number).

  • Input: National Health Service Number (NHS number).

  • Output: NHS number (standardized and best existing value) with a scoring and explanation of all discrepancies in the data.

The allowed input NHS number formats depend on optional comments before and/or after NHS number:

  • If comments contain digits, the input format is:

    • Ten digits only. For example, 1234567890.

    • Ten digits with the same separators (spaces or dashes) at correct positions (between 3th-4th and 6th-7th digit). For example, 123-456-7890 or 123 456 7890.

    • Ten digits with different separators (space, dash or no separator) at correct positions (between 3th-4th and 6th-7th digit). For example, 123-4567 890.

    • Ten digits with spaces anywhere in between. For example, 12 34 56 7 8 9 0.

  • If comments do not contain digits, the input format is:

    • Ten digits only. For example, 1234567890.

    • Ten digits with the same separators (spaces or dashes) at correct positions (between 3th-4th and 6th-7th digit). For example, 123-456-7890 or 123 456 7890.

  • In case of multiple NHS numbers, all except the first one are considered as comments.

The standardized value of NHS number is the value which is valid according to the component behavior (10-digit-only value).

The best existing value of NHS number:

  • Standardized value.

  • Parsed and cleansed value, placeholder value - has the NHS number format but is not verified by the check digit algorithm.

  • Input value.

Free

Cleanse National Insurance Number GB

Used to validate GB National Insurance Number (NINO). The number is described by the United Kingdom government as a "personal account number".

  • Input: National Insurance number (NINO).

  • Output: NINO (standardized and best existing value) with a scoring and explanation of all discrepancies in the data.

Allowed input NINO formats depend on optional comments before and/or after NINO.

  • Allowed formats:

    • Two letters and six digits and one letter (for example, AB123456C).

    • Two letters and six digits and one letter with whitespace separators anywhere (for example, AB12 3 4 5 6C).

If there are multiple NINOs, all except the first one are considered as comments. For example, if in_nino is "AB123456C DE456123D", out_nino would be "AB123456C".

  • NINO format:

    • Neither of the first two letters can be D, F, I, Q, U, or V. The second letter also cannot be O.

    • After the two prefix letters, the six digits are issued sequentially from 00 00 00 to 99 99 99.

    • The suffix letter is either A, B, C, or D, although F, M, and P have been used for temporary numbers in the past.

    • Temporary insurance numbers have "TN dd mm yy x" format where 'dd' is day, 'mm' is month, 'yy' is year, and 'x' is suffix letter (F, M, or P).

The standardized value of NINO is the value which is valid according to the component behavior (two letters and nine digits and one letter only value).

The best existing value of NINO:

  • Standardized value.

  • Parsed and cleansed value - looks like NINO but it is temporary NINO.

  • Input value.

Free

Cleanse Person Name GB

Used to verify a person’s name, determine gender, and split name into separate columns.

  • Input: First name, last name, and middle name. If there is a full name in the input, the last name column can be used for it.

  • Output: First name, last name, middle name, scoring and explanation of all discrepancies in the data. In the output, there is also the pattern, titles, gender (with a scale 1-5, where 1 is male, 5 female, 3 unknown, and the values in between describe the level of certainty).

  • Rules for output:

    • If the name is parsed and the whole name is verified, then out_ and std_ values are the same. Both are capitalized.

    • If the name is parsed and part of the name is not verified, then all out_ columns and some std_ columns are populated. std_ is populated only when all words in the corresponding out column are verified.

    • Initials go to the output without dots. If there are two initials and the first name is empty, then the first initial is set to the first name and the second initial is set to the middle name.

    • If the parsed last name contains a dash, then the output values do not contain spaces around (for example, John - William would be John-William). Other special characters are not allowed in the output except apostrophe.

    • std values are set if the parsed value is verified in the lookup. Capitalized source (original-parsed) values are used.

Free

Cleanse Phone Number GB

Used to validate a phone number. The component can verify mobile phone numbers (starting with '07') and geographic phone numbers (starting with '01' or '02'). It is also able to deal with redundant comments.

  • Input: Phone number.

  • Output: Standardized value of phone number, data quality score of phone number, data quality imperfection explanations, best existing value of phone number (standardized, cleansed, input value), and comment value.

  • Allowed input number formats:

    • Phone number with or without trunk code ('0' in GB).

    • Phone number with or without IDC code ('44' in GB).

    • Phone number with or without a preceding or following comment.

The standardized value of the phone number is the value which is valid according to the component behavior.

The best existing value of phone number:

  • Standardized value

  • Cleansed value

  • Cleansed value not verified in the lookup

  • Parsed value

  • Input value

If the phone number is with IDC, then it begins with a plus sign (+).

Free

Cleanse VAT Number GB

Used to validate and standardize British VAT number. The component does not check whether the VAT number exists, only the checksum digits.

  • Input: Any string.

  • Output: std VAT number, best existing VAT, score, and data discrepancies.

The standardized VAT number has one of the following formats (d = digit): GBddddddddd (standard format), GBdddddddddddd (branch traders with additional three digits), GBGDddd (government departments), GBHAddd (health authorities).

Free

United States of America

Component name Description Availability

Address Identifier US (step)

Used to cleanse, standardize, and enrich US addresses.

  • Input: Three address lines and a ZIP attribute.

    • Address line 1: Street preferred.

    • Address line 2: City preferred.

    • Address line 3: State preferred.

  • Output: Best existing values of address attributes (street, city, state, ZIP), validity level, and data quality imperfection explanations and scoring.

Contact us

Address Quick Search US

Used to propose addresses based on the input string containing incomplete parts of address.

  • Input: Address in the supported pattern.

  • Output: Number of proposals, proposed addresses, explanation (if there was a problem with input).

Contact us

Cleanse Company Name US (simple)

Used to cleanse US company name (CN) and search for the legal element.

  • Input: in_company_name.

  • Output: There are two output columns for company name (std_ and out_). std value is filled only if it meets all the requirements (score is < 10000).

    • std_company_name - Standardized company name (score < 10000).

    • out_company_name - Best existing value of company name.

    • out_company_name_base - Company name without the legal element.

    • std_legal_form - Legal form derived from company name. If more than one legal form is found, a list of standardized forms is returned.

    • std_vulgar_words - Vulgar words found in company name.

    • sco_company_name - Score column for company name.

    • exp_company_name - Explanation column for company name.

  • Form of legal element:

    • std_company_name column contains the input form of legal entity; the component only standardizes the case and adds periods to legal elements if needed (for example, CORPORAtioN to Corporation, CORP to Corp., Ltd to Ltd.).

    • std_legal_form column contains a single standardized form of legal entity (full text, without abbreviations: for example, CORPORAtioN to Corporation, Corp. to Corporation, Ltd to Limited).

      Therefore, for matching, we recommend using out_company name_base + std_legal_form.

Free

Cleanse EIN and ITIN US

Used to validate and cleanse US Employer Identification Number (EIN) and Individual Taxpayer Identification Number (ITIN).

  • Input: EIN/ITIN input column (mandatory), type of number (optional).

  • Output:

    • out_ein_itin - Best value of EIN/ITIN. This number can be invalid.

    • out_ein_itin_type - Input type, if it was provided. Else it is derived type, if derivation was possible.

    • std_ein_itin - Standardized output, verified EIN prefix, or ITIN validity rules (score < 10000).

    • exp_ein_itin - Explanation column.

    • sco_ein_itin - Score column.

Free

Cleanse Person Name US

Used to verify a person’s name, determine gender, and split name into separate columns.

  • Input: First name, last name, and middle name. If there is a full name in the input, the last name column can be used for it.

  • Output: First name, last name, middle name, scoring and explanation of all discrepancies in the data. In the output, there is also the pattern, stop words, synonyms for the first name and middle name, gender (with a scale 1-5, where 1 is male, 5 female, 3 unknown, and the values in between describe the level of certainty).

  • Rules for output:

    • If the name is not parsed, then out_ name columns are populated with input values without stop words. If names are the same as the full value of stop words, then the output value is empty. std_ columns are empty.

    • If the name is parsed and the whole name is verified, then out_ and std_ values are the same. Both are capitalized.

    • If the name is parsed and part of the name is not verified, then all out_ columns and some std_ columns are populated. std_ is populated only when all words in the corresponding out column are verified.

    • Initials go to the output without dots. If there are two initials and the first name is empty, then the first initial is set to the first name and the second initial is set to the middle name.

    • If the parsed last name contains a dash, then the output values do not contain spaces around (for example, John - William would be John-William). Other special characters are not allowed in the output except apostrophe.

    • std values are set if the parsed value is verified in the lookup, but std is set from the parsed original value and not from the lookup value.

Free

Cleanse Phone Number US (complex)

Used to cleanse US phone numbers. The component can also cleanse and validate foreign numbers (only by international dialing code, that is, IDC code). The component also validates standard US numbers, however, it does not include non-regional and special numbers like toll-free numbers, emergency, or non-standard contact numbers.

In comparison to the simple version, this component can identify foreign numbers, fictive numbers, comments, extensions, and intervals in a number. The format of a valid output number can be modified.

  • Input: Phone number.

  • Output:

    • Standardized value of number.

    • Data quality score of number.

    • Data quality imperfection explanations.

    • Best existing value of number (standardized, foreign, or parsed value).

    • Best existing IDC (standardized or parsed value).

    • Best existing value of area code (value verified in the dictionary or parsed).

    • Best existing value of central office code (value verified in the dictionary or parsed).

    • Best existing value of station number (parsed value).

    • Comments.

  • Allowed input number formats:

    • Number with US IDC (1) or without IDC.

    • Number with extension (1- to 5-digit long number).

    • Number with interval in the format <number>-<interval>.

    • Foreign numbers indicated by trunk code 00 or a plus sign (+) in front of the number. IDC 1 is not considered a foreign number; numbers from Canada and other countries in the North American numbering system are considered invalid US numbers; to standardize these numbers, use ca_phone_number_cleanse component.

    • Number with textual comment before and after the number.

    • Number can contain only digits, letters, dashes, round brackets, plus signs, hashes, white spaces, dots, colons, commas.

  • Not allowed (due to processing issues):

    • Foreign numbers not indicated by 00 or a plus sign (+) in front of the number. These numbers are considered Canadian and might even match some lookup file values.

    • Numbers with period where the period number has more than two digits.

Free

Cleanse Phone Number US (simple)

Used to cleanse US phone numbers.

  • Input: Phone number.

  • Output:

    • Standardized value of number.

    • Data quality score of number.

    • Data quality imperfection explanations.

    • Best existing value of number (standardized, parsed, or input value).

    • Best existing IDC (parsed value).

    • Best existing value of area code (value verified in the lookup file or parsed).

    • Best existing value of central office code (value verified in the lookup file or parsed).

    • Best existing value of station number (parsed value).

  • llowed input number formats:

    • Number with Canadian IDC (in the form +1, 1, or 001) or without IDC.

    • Number can contain any character. All except digits are removed before parsing.

    • Number with comment is parsed but the comment is lost.

    • Empty record or input value that does not contain any characters - number is considered NULL.

  • Not allowed (due to processing issues):

    • Foreign numbers (considered invalid).

    • Numbers with extension or interval (considered invalid).

Free

Cleanse Social Security Number US

Used to validate US Social Security Numbers (SSN).

  • Input:

    • Source SSN (mandatory).

    • Date when SSN was issued (optional).

  • Output:

    • Standardized value of SSN (cleansed and valid SSN).

    • Best known value of SSN (standardized, cleansed, or input value).

    • Standardized value of SSN area code (area where SSN was issued).

    • Standardized issued date.

    • Data quality imperfection explanations.

The standardized SSN is derived from any input containing seven to nine numbers (valid input according to the component behavior).

Free

Mask Address US

Used to mask US addresses in a smart way so that some validity and cross-validity is preserved.

  • Input: US address separated into the following components: STREET LINE (address line 1), CITY (address line 2), STATE (address line 3), ZIP CODE.

  • Output: Masked US address separated into the same components as the input: STREET LINE, CITY, STATE, ZIP CODE.

The expected input format for the street line: Street line can contain these components: primary address number, predirectional, street name, suffix, postdirectional, secondary address identifier, secondary address, rural road identifier and number, general delivery identifier, PO box identifier and number. Which elements are used depends on the address type.

The following street line patterns are possible:

  • {STREET_NUMBER} \{STREET_PREDIRECTION!} \{STREET_NAME!} \{STREET_SUFFIX!} \{STREET_POSTDIRECTION!} \{SEC_ADR_TYPE!} {SEC_ADR_NUMBER}

  • {STREET_NUMBER} \{STREET_PREDIRECTION!} \{STREET_NAME!} \{STREET_SUFFIX!} \{SEC_ADR_TYPE!} {SEC_ADR_NUMBER}

  • {STREET_NUMBER} \{STREET_NAME!} \{STREET_SUFFIX!} \{STREET_POSTDIRECTION!} \{SEC_ADR_TYPE!} {SEC_ADR_NUMBER}

  • {STREET_NUMBER} \{STREET_PREDIRECTION!} \{STREET_NAME!} \{SEC_ADR_TYPE!} {SEC_ADR_NUMBER}

  • {STREET_NUMBER} \{STREET_NAME!} \{STREET_SUFFIX!} \{SEC_ADR_TYPE!} {SEC_ADR_NUMBER}

  • {STREET_NUMBER} \{STREET_PREDIRECTION!} \{STREET_NAME!} \{STREET_SUFFIX!}

  • {STREET_NUMBER} \{STREET_NAME!} \{STREET_SUFFIX!} \{STREET_POSTDIRECTION!}

  • {STREET_NUMBER} \{STREET_NAME!} \{SEC_ADR_TYPE!} {SEC_ADR_NUMBER}

  • {STREET_NUMBER} \{STREET_PREDIRECTION!} \{STREET_NAME!}

  • {STREET_NUMBER} \{STREET_NAME!} \{STREET_POSTDIRECTION!}

  • {STREET_NUMBER} \{STREET_NAME!} \{STREET_SUFFIX!}

  • {STREET_NUMBER} \{STREET_NAME!}

  • \{GD_IDENTIFIER!}

  • \{POBOX_IDENTIFIER!} {BOX_NUMBER}

  • \{RR_IDENTIFIER!} {RR_NUMBER} {BOX} {BOX_NUMBER}

The city, state, and ZIP code are expected to be in the correct input column and valid. In case of masking arbitrary invalid data, the input format is not specified. However, correct mapping of address lines to input columns is required in all cases.

Contact us

Mask Company Name US

Used to mask US company names.

  • Input: Company name.

  • Output: Masked company name.

Contact us

Mask Person Name US

Used to mask a person’s name using a translation lookup files or a transliteration.

  • Input: First name, last name, and middle name.

  • Output: masked first name, masked last name, and masked middle name.

The name can contain any characters if it is found in the lookup with masked names. If the name is not found, then any letter in the name is translated to another letter. Special characters and numbers in the name are not translated.

Contact us

Mask Phone Number US

Used to mask US phone numbers.

  • Input: Phone number (international dialing code, area code, central office code, station number).

  • Output: Masked phone number (US international dialing code or random code, masked area code, masked central office code, masked station number).

  • Valid or recognized input format:

    • {1}\{area code}\{central office code}\{station number}

    • \{+1}\{area code}\{central office code}\{station number}

    • {001}\{area code}\{central office code}\{station number}

    • \{area code}\{central office code}\{station number}

    • {1} \{area code} \{central office code} \{station number}

    • \{+1} \{area code} \{central office code} \{station number}

    • {001} \{area code} \{central office code} \{station number}

    • \{area code} \{central office code} \{station number}

    • {1} (\{area code}) \{central office code} \{station number}

    • \{+1} (\{area code}) \{central office code} \{station number}

    • {001} (\{area code}) \{central office code} \{station number} (\{area code}) \{central office code} \{station number}

Where idc is 1, +1, or 001, area code a value verified in the lookup file, central office code a value verified in the lookup file, station number a value with length of four digits, and area code and central office code are consistent. Comments before and/or after the phone number are allowed for each pattern.

  • Invalid or unrecognized input:

    • Not parsed phone number.

    • Incomplete US phone number.

    • Multiple phone numbers.

All digits are masked randomly and other characters are left as they are.

Contact us

Mask Social Security Number US

Used to mask US SSN. SSN is masked randomly, using seed tables.

The format and validity of SSN is preserved. Characters before or after the parsed SSN are not masked.

  • Input: in_ssn.

  • Output: out_ssn (masked SSN).

Contact us

Was this page useful?