User Community Service Desk Downloads
If you can't find the product or version you're looking for, visit support.ataccama.com/downloads

ONE Expressions

This article describes expressions used within ONE steps. Places where the expressions can be used are mentioned in the description sections of the appropriate steps.

Operands

An operand is any expression with a type corresponding to a valid type for a given operation. Operands can be divided into four categories:

  • Literals - Numeric, string, or logical constants (TRUE, FALSE, UNKNOWN (deprecated), all keywords are case-insensitive) and null literal (case-insensitive).

  • Columns - Defined by their names. If there is a space character in the column name, the name must be enclosed in square brackets ([]). If the step retrieves data from multiple inputs, the column names are specified using the dot notation format, that is, input_name.column_name. If the step uses only one input, the dot notation may be omitted.

  • Set - Can be used only in combination with the IN operation where the set represents a constant expression. A set can only occur on the right side of the IN operation.

  • Complex expressions.

Expression operands can be of a defined column type, such as INTEGER, FLOAT, LONG, STRING, DATETIME, DAY, and BOOLEAN. If a number assigned to either an INTEGER or LONG variable overflows or underflows the interval of permitted values for that type (that is, -2147483648;+2147483647 for INTEGER and -9223372036854775808;+9223372036854775807 for LONG,) then the number wraps around the interval. For example, the value 2147483649 assigned to an INTEGER variable is interpreted as -2147483647.

Operands are automatically converted to a wider type if needed. This is relevant for numeric data types INTEGER, LONG, and FLOAT (widening INTEGERLONGFLOAT) and datetime types DAY and DATETIME (DAYDATETIME). In case of comparisons, set, and conditional operations, all operands are first converted to the most general type and then the operation is performed.

Handling null values

Operations and functions handle arguments with null value according to SQL rules. One important exception is for STRING data type. A null string and an empty string are considered equal, so null string arguments are handled as empty (zero length) strings.

For example, "abc" == NULL, or even "`abc" > NULL`, is a legal comparison and gives non-null Boolean result analogous to "abc" == "", or "abc" > "" respectively, whereas in SQL language these expressions give both null (UNKNOWN) value.

Variables

The expression can be formed as a sequence of assignment expressions followed by one resulting expression. Multiple expressions are delimited by a semicolon (;). Assignment expressions have the following syntax: variable := expression.

The first occurrence of a variable on the left-hand side defines this variable and its type. A reference to a variable in an expression is valid only after its definition. Each following occurrence of a variable, including on the left-hand side of an assignment expression, must conform to the variable’s type.

Example
a := 2;
b := 4 - a;
3 * b

Operations and functions

ONE provides the following function and operation categories:

  • Arithmetic operations

  • Logical operations

  • Comparison operations

  • Set operations

  • Other operations

  • Date functions

  • String functions

  • Bitwise functions

  • MinMax functions

  • Aggregating functions

  • Conditional expressions

  • Conversion and formatting functions

  • Word set operation functions

All functions and operations that do not have the "locale" parameter set or defined are set to use the default ONE locale. The step locale setting does not influence this behavior.

Arithmetic operations

This category includes common arithmetic operations—​addition, subtraction, multiplication, and division.

The result of an arithmetic operation applied to the types INTEGER and/or LONG is always INTEGER or LONG. The result is of type LONG if at least one operand was of type LONG.

The type NUMBER in the description of input and output types stands for data types INTEGER, LONG, or FLOAT.
Operator Usage Description

-

a - b

Subtraction of numeric operands a and b.

-

-a

Negation of numeric operand a, for example -(a*c).

A unary expression operator cannot immediately follow another arithmetical operator unless in parentheses. Therefore, the expression a*-b is invalid, use either -b*a or a*(-b) instead.

/

a / b

Division of numeric operands a and b.

*

a * b

Multiplication of numeric operands a and b.

%

a % b

Modulo: the remainder after numerical division of a by b.

+

a + b

Addition of numeric operands a and b, or string concatenation.

div

a div b

Division of integer operands without a remainder.

Logical operations

Common logical operations AND, OR, XOR, and NOT (all keywords are case-insensitive).

Operator Usage Description

AND

a AND b

Logical conjunction.

NOT

NOT a

Logical negation.

OR

a OR b

Logical sum.

XOR

a XOR b

Exclusive OR.

Comparison operations

Relational operators.

Operator Usage Description

<

a < b

Tests if the value a is less than b.

<=

a <= b

Tests if the value a is less than or equal to b.

<>, !=

a <> b

a != b

Tests the negated equivalence of two values.

=, ==

a = b

a == b

Tests the equivalence of two values.

>

a > b

Tests if the value a is greater than b.

>=

a >= b

Tests if the value a is greater than or equal to b.

Set operations

For sets, a few basic operations--IN, IS IN, NOT IN, IS NOT IN-- are implemented. Set members are literals of types defined for columns or column names themselves.

Operator Usage Description

in

a in {elem[, elem]…​}

Tests whether a is member of the specified set. As opposed to the "is in" operation, if a is not member of the set and a null value is a member of the set, then the result is NULL.

is in

a is in {elem[, elem]…​}

Tests whether a is a member of the specified set. Always returns TRUE or FALSE.

is not in

a is not in {elem[, elem]…​}

Tests whether a is not a member of the specified set.

not in

a not in {elem[, elem]…​}

Tests whether a is not a member of the specified set. As opposed to the "is not in" operation, if a is not member of the set and a null value is a member of the set, then the result is NULL.

Example
company IN {"Smith inc.", "Smith Moving inc.",
            "Speedmover inc.", [candidate column], clear_column}

a IN {1, 2, 5, 10}

b IN {TRUE, FALSE}

Other operations

Function Usage Description

getParameterValue

string getParameterValue(string key [, string default])

The function returns the value of the property or the default value if there is no property with that key. All property names must have a prefix defining where the property should be read from.

The following prefixes are available:

  • system. - Get a Java system property. A typical use is getParameterValue("system.java.version").

    A system property can be set or overridden by specifying the -D option to the Java command when running your program. For example, java.exe -Dmy.prop="my value" …​.

    To get the value of the system property, you can use getParameterValue("system.my.prop").

  • env. - Get an environment variable. A typical use is getParameterValue("env.JAVA_HOME").

  • http. - Get HTTP request parameters and some other properties related to HTTP request like getParameterValue("http.user-agent"), getParameterValue("http.soapAction"), getParameterValue("http.accept"), getParameterValue("http.my-header").

  • security. - Get information about the authenticated user, assigned security roles, and parameters connected with the authenticated user.

  • security.user - Returns the name of the authenticated user.

  • security.user.<attribute_name> - Returns the value of the 'attribute_name' attribute assigned to the current authenticated user. The value is assigned, for example, by setting LdapIdentityProvider and BasicAttribibuteResolver.

  • security.role.<role_name> - A Boolean indicator of whether the authenticated user has assigned the role with name 'role_name'.

getRuntimeVersion

string getRuntimeVersion()

The function returns the product version.

is

a is b

Tests if a is equal to b. Null values are allowed as operands.

A typical use is a is null.

is not

a is not b

Tests if a is not equal to b. Null values are allowed as operands.

A typical use is a is not null.

setParameterValue

string setParameterValue(string key, string value)

The function sets the value of the property. All property names must have a prefix defining the category where the property should be stored to. Not all property categories allow the definition of new properties (for example, Java system properties - prefix system. cannot be set by ONE Desktop).

Examples are: setParameterValue("http.outputParam", "paramValue"), setParameterValue("http.out-Header", "value").

geoDistance

float geoDistance(float lat1, float long1, float lat2, float long2)

Calculates the shortest distance (in meters) between two points on Earth. A point is defined by its latitude and longitude in degrees.

The decimal part of the value represents a proportion of the degree unit, not minutes of arc. For example, 43.25 represents angle 43o15'.

namedSequence

integer sequence(string name [, integer start=0 [, integer step=1]])

Generates the next number each time it is called.

Behind the scenes, the following happens: when called, the operation checks whether it was already called with the given name in the running plan or in the online server. If it was not called previously, it creates a counter object that stores the last generated number; in the first iteration, the number is the value of the start parameter.

Otherwise, it takes the already created counter object, increments the value of the object by the step parameter, and returns the stored number.

If there are more named sequences with different start and step parameters called in the plan, then the counter object is created with the parameters of the first call. The default values are namedSequence(0,1).

random

integer random([[integer from=0], integer to=1])

Generates a random number from the interval defined by the parameters from and to.

The from parameter must be a number lower or equal to the to parameter, otherwise the null value is returned. If any of the parameters evaluates to null value, null is returned as well.

The default values are 0 for the from parameter and 1 for the to parameter.

randomUUID

string randomUUID()

A static factory to retrieve a type 4 (pseudo randomly generated) UUID. The UUID is generated using a cryptographically strong pseudorandom number generator.

sequence

integer sequence(integer [, start=0 [, integer step=1]])

It generates the next number each time it is called.

Behind the scenes, the following happens: when called, the operation checks whether it was already called with the given parameters in the running plan or in the online server. If it was not called previously, it creates a counter object that stores the last generated number; in the first iteration, the number is the value of the start parameter.

Otherwise, it takes the already created counter object, increments the value of the object by the step parameter, and returns the stored number.

The default values are sequence(0,1).

Date functions

In ONE, a date is represented by DAY and DATETIME types. The DAY type represents a date to the detail level of days, DATETIME to the detail level of milliseconds.

The time values that are compatible with each format are named and described in the following table.

Date part name Range Included in date type

YEAR

Any positive number

DATETIME, DAY

MONTH

1 - 12

DATETIME, DAY

DAY

1 - max.month

DATETIME, DAY

HOUR

0 - 23

DATETIME

MINUTE

0 - 59

DATETIME

SECOND

0 - 59

DATETIME

A day starts at 00:00:00 and ends at 23:59:59. If a given function requires identification of a date part as parameter, the identifier is written in the expression in the form of a string literal, for example, "MONTH", otherwise the expression is evaluated as incorrect.

Identifiers are case sensitive and must be written in uppercase. For example: expression='dateAdd(inDate,10,"DAY")'.

All the listed date parts are represented by positive integers. Milliseconds are not supported by the date functions.

Date type DATE-TYPE stands for date types DAY or DATETIME in the description of input and output types.
Function Usage Description

dateAdd

datetype dateAdd(datetype srcDate, integer srcValue, string fieldName)

Adds the specified srcValue of the type specified by fieldName (YEAR, MONTH, DAY, HOUR, MINUTE, or SECOND) to the srcDate.

This function allows subtraction, so the srcValue can be negative. The return value is the result of the add(subtract) operation.

If any of the operands are invalid or if an unsupported fieldName is added to the data type, then the expression reports an error.

dateDiff

integer dateDiff(datetype startDate, datetype endDate, string fieldName)

Returns the difference between endDate and startDate expressed in fieldName units.

If the result exceeds the maximum range of INTEGER, then the value NULL is returned. If any of the parameters are invalid, the expression reports an error.

A combination of date type DAY and fieldName HOUR, MINUTE, SECOND can be used (the value of these fields is considered to be 0).

datePart

integer datePart(datetype srcDate, string fieldName)

This function returns the value of the field fieldName of srcDate.

If any of the parameters are invalid, the expression reports an error. For the fields HOUR, MINUTE, and SECOND set for the date-type DAY, the function returns 0.

dateTrunc

datetype dateTrunc(datetype srcDate, string fieldName)

The function truncates less important parts of the srcDate up to the level specified by fieldName. Truncation changes values of the fields by the following rules: MONTH and DAY to 1, HOUR, MINUTE, and SECOND to 0.

The function can be used even for the DAY type with the fieldName HOUR, MINUTE, and SECOND. The function does not have an effect on the data - the result and input values are the same.

If any of the parameters are invalid, the expression reports an error.

Example: for srcDate "5.5.1980 12:35:10" and fieldName "HOUR", the function returns 5.5.1980 12:00:00.

getDate

day getDate(datetype srcExpression)

Returns the date in the format defined by the specified srcExpression (type DAY or DATETIME), with the time set to zero (HH:mm:ss.SSS).

getRequestTime

datetime getRequestTime()

Returns the time when the current request processing was started. This is the ONE application start time in batch mode and the web service request time in online mode.

now

datetime now()

Returns the current time with the type DATETIME. This function always returns the time when it is evaluated, that is, the current time.

today

day today()

Returns the current date in type DAY. This function is designed to return the same value for all records (ONE application start date), even if ONE runs past midnight.

String functions

The following are common functions used for string processing.

Function Usage Description

capitalize

string capitalize(string srcStr)

Transforms all words in the string srcStr in the following manner: the first character of each word to uppercase and all following characters to lowercase.

A word consists of alphabetic characters (letters). All other characters are considered separators.

capitalizeWithException

string capitalizeWithException(string srcStr [, boolean leaveExcsAsTheyAre=false] [, string exc]…​)

Transforms all words in the string srcStr (with the exception of the words given as parameters exc) in the following manner: the first character of each word to uppercase and all following characters to lowercase.

A word consists of alphabetic characters (letters). All other characters are considered separators.

A current downside is that words from exc are transformed to lowercase. If the parameter leaveExcsAsTheyAre is set to true, the transformation leaves words from exc the way they are in the string srcStr, that is, without any change.

containsWord

boolean containsWord(string srcStr, string srcWord [, string separator=" "])

Searches for the occurrence of the word srcWord in the string srcStr.

A word is a sequence of letters with no whitespaces. Words in the string are defined as sequences of letters separated by a space (' ' or the given separator). The leading, trailing, and multiple spaces are ignored.

This function is case sensitive.

countNonAsciiLetters

integer countNonAsciiLetters(string srcStr)

Returns the number of characters from the string srcStr that include diacritical marks.

cpConvert

string cpConvert(string srcStr, string actualCp, string correctCp)

Takes a string as an input wrongly read using the actualCp charset and transforms it into a correct correctCp charset.

For example, if you have a file that is entirely in windows-1250 charset except for one column, a, which is in the latin2 charset, it will be read using the windows-1250 charset and for the a column the expression cpConvert(a, 'windows-1250', 'latin2') can be used.

diceCoefficient

float diceCoefficient(string srcStr1, string srcStr2 [, string tokenSeparator=null] [, integer ngramLength])

Returns the diceCoefficient for strings srcStr1 and srcStr2. Returned values are between 0 and 1, where 1 indicates a perfect match (identical strings) and 0 no similarity. The function is case insensitive.

The method uses a set similarity measure called Sørensen-Dice coefficient. It uses the following formula: DQ(A, B) = (2 * |A ∩ B|) / (|A| + |B|), where A, B are sets.

Since the measure is originally used for comparing two sets, this implementation offers three ways of slicing the input string to "tokens" (set elements). They are distinguished by the method signature:

  • diceCoefficient(srcStr1, srcStr2, ngramLength) - Sets are filled with extracted n-grams of constant length n created from each string. Any whitespace ise treated as any other character.

    For example, the word OVERVIEW with ngramLength set to 4 will be transformed to a set of following n-grams: OVER, VERV, ERVI, RVIE, VIEW.

  • diceCoefficient(srcStr1, srcStr2, tokenSeparator) - Sets are filled with source strings split by the tokenSeparator character. Therefore, the tokens compared are of varying length.

  • diceCoefficient(srcStr1, srcStr2, tokenSeparator, ngramLength) - Combines the two previous approaches. At first, the input string is split by tokenSeparator, then the separated tokens are transformed to n-grams of constant length (ngramLength) and inserted to a common comparsion set. If the token is shorter than ngramLength, it is inserted to the set as is. Therefore, tokens shorter than ngramLength might appear in comparative sets.

distinct

string distinct(string srcStr, null)

Returns a string which contains the concatenated first occurrence of parts of the original string srcStr. If the string srcStr is null or empty, the function returns NULL.

doubleMetaphone

string doubleMetaphone(string srcStr [, boolean isAlternate=false])

Encodes srcStr to a double metaphone primary or secondary string based on the parameter isAlternate. It removes accents from srcStr before evaluating the double metaphone value.

If isAlternate is true, it returns the secondary string, otherwise it returns the primary string.

editDistance

integer editDistance(string srcStr1, string srcStr2 [, integer maxDistance] [, boolean caseInsensitive=true])

Returns the edit distance between strings srcStr1 and srcStr2.

The parameter maxDistance defines the maximum value to which the distance is calculated (null or a negative value is considered as 0). Strings with distance greater than this value are considered to have distance maxDistance + 1.

The parameter caseInsensitive determines whether case sensitivity should be considered or not. A null value is considered as false. By default, the function is case insensitive.

The difference between Levenshtein and Edit distance lies in the definition of distance of two switched adjacent characters. The Levenshtein one considers the switch as two changes whereas the Edit distance considers the switch to be a single change.

If both of the strings are null, then the result is 0. If only one of the strings is null, then the result is the length of the other string.

This function is also known as Damerau-Levenshtein distance.

eraseSpacesInNames

string eraseSpacesInNames(string srcStr, integer minLength, boolean onlyUpper)

Removes spaces between separate characters (words of length 1``) in the string srcStr.

The parameter minLength specifies the minimum length of the newly created word (that is, spaces are removed only if, after their removal, the result has the length of at least minLength).

The parameter onlyUpper is a Boolean value which restricts the space removal. If set to TRUE, then only spaces between capital letters are processed. If set to FALSE, then all spaces between separate characters are processed.

find

boolean find(string srcRegex, string srcStr [, boolean caseInsensitive=false])

Verifies whether the string srcStr or its parts match the regular expression srcRegex.

The parameter caseInsensitive determines whether the case sensitivity should be considered or not. By default, the function is case sensitive.

If the string srcStr is null or empty, the function returns NULL. For information about regular expressions, see Regular expressions.

hamming

integer hamming(string srcStr1, string srcStr2 [, boolean caseInsensitive=true])

Returns the Hamming distance between strings srcStr1 and srcStr2.

The parameter caseInsensitive determines whether case sensitivity should be considered or not. A null value is considered as false. By default, the function is case insensitive.

If both of the strings are null, then the result is 0. If only one of the strings is null, then the result is the length of the other string.

indexOf

integer indexOf(string srcStr, string subStr [, integer fromIndex=0])

Returns the index within the string srcStr of the first occurrence of the specified substring subStr, starting at the index fromIndex.

If the substring is not found, the value NULL is returned. If the value fromIndex exceeds the length of the string srcStr, the value NULL is returned.

If the value fromIndex < 0, the start of the search is counted relative to the end of the string. However, if the counted start overlaps the string start, then the search starts at the beginning of the string srcStr instead.

The index of the first character is 0.

isInFile

boolean isInFile(string srcStr, string fileName)

Searches for the string srcStr in a file defined by the parameter fileName. The parameter fileName must be a constant expression and must point to a dictionary with simple values.

The function returns TRUE if srcStr is found in the dictionary, and FALSE otherwise. Before the search starts, the value of srcStr is trimmed (all whitespaces from the beginning and end of the string are removed), which can potentially lead to null value of the search value.

isNumber

boolean isNumber(string srcStr)

Verifies whether the string srcStr represents a number. All characters of the string must be digits except for the first character which can be either a plus (+) or a minus sign (-). Decimal numbers are evaluated as not being numbers (that is, the characters such as a dot (.) or comma (,) are not allowed).

jaccardCoefficient

float jaccardCoefficient(string srcStr1, string srcStr2 [, string tokenSeparator=null] [, integer ngramLength])

Returns the jaccardCoefficient for strings srcStr1 and srcStr2. The returned values are between 0 and 1, where 1 indicates a perfect match (identical strings) and 0 no similarity. The function is case insensitive.

The method uses a set similarity measure called Jaccard coefficient. It uses the following formula: JQ(A, B) = (|A ∩ B|) / (|A ∪ B|), where A, B are sets.

Since the measure is originally used for comparing two sets, this implementation offers three ways of slicing the input string to "tokens" (set elements). They are distinguished by the method signature:

  • jaccardCoefficient(srcStr1, srcStr2, ngramLength) - Sets are filled with extracted n-grams of constant length n created from each string. Any whitespace is treated as any other character.

    For example, the word OVERVIEW with ngramLength set to 4 is transformed to a set of the following n-grams: OVER, VERV, ERVI, RVIE, VIEW.

  • jaccardCoefficient(srcStr1, srcStr2, tokenSeparator) - Sets are filled with source strings split by the tokenSeparator character. Therefore, the tokens compared are of varying length.

  • jaccardCoefficient(srcStr1, srcStr2, tokenSeparator, ngramLength) - Combines the two previous approaches. At first, the input string is split by tokenSeparator, then the separated tokens are transformed to n-grams of constant length (ngramLength) and inserted to a common comparison set. If the token is shorter than ngramLength, it is inserted to the set as is. Therefore, tokens shorter than ngramLength might appear in comparative sets.

jaroWinkler

float jaroWinkler(string srcStr1, string srcStr2 [, boolean caseInsensitive=true] [, float boostTreshold=0, 7] [, integer prefixLength=4])

Returns the Jaro-Winkler score for strings srcStr1 and srcStr2. The returned values are between 0 and 1, where 1 indicates a perfect match (identical strings) and 0 no similarity. It is suited for evaluating short strings (for example, personal names).

This is a combination of Jaro’s algorithm with Winkler’s boost for identical prefixes. For an extensive description of the calculation, see Jaro-Winkler distance.

To turn off the Winkler’s prefix boost (obtain pure Jaro), set the boost threshold to a value greater than 1.0. The default values for boostTreshold and prefixLength are considered to be optimal (according to original Winkler’s papers), but you can adjust these to finetune your results.

By default, the function is case insensitive. When case sensitivity is turned on, uppercase and lowercase variants of the same letter are considered as two completely different characters.

There is currently no support or optimization for diacritical marks (the function considers c and č as two completely different characters).

lastIndexOf

integer lastIndexOf(string srcStr, string subStr [, integer fromIndex=0])

Returns the index within the string srcStr of the last (rightmost) occurrence of the substring subStr, starting at the index fromIndex.

If the substring is not found, the value NULL is returned. If the value fromIndex exceeds the length of the string srcStr, the value NULL is returned.

If the value fromIndex < 0, the start of search is counted relative to the end of the string. However, if the counted start overlaps the string start, then the search starts at the beginning of the string srcStr.

The index of the first character is 0.

left

string left(string srcStr, integer count)

Returns a new string that is a substring of srcStr containing the number of characters taken from the start of srcStr.

length

integer length(string srcStr)

Returns the number of characters in the string srcStr.

levenshtein

integer levenshtein(string srcStr1, string srcStr2 [, integer maxDistance] [, boolean caseInsensitive=true])

Returns the Levenshtein distance between strings srcStr1 and srcStr2.

The parameter maxDistance defines the maximum value to which the distance is calculated (null or a negative value is considered as 0). Strings with distance greater than this value are considered to have distance maxDistance + 1.

The parameter caseInsensitive determines whether the case sensitivity should be considered or not. A null value is considered as false. By default, the function is case insensitive.

If both of the strings are null, then the result is 0. If only one of the strings is null, then the result is the length of the other string.

lower

string lower(string srcStr)

Transforms all characters of the string srcStr to lowercase.

matches

boolean matches(string srcRegex, string srcStr [, boolean caseInsensitive=false])

Verifies whether the string srcStr matches exactly the pattern of the regular expression srcRegex.

The parameter caseInsensitive determines whether case sensitivity should be considered or not. By default, the function is case sensitive.

If the string srcStr is null or empty, the function returns NULL.

For information about regular expressions, Regular expressions.

metaphone

string metaphone(string srcStr)

Encodes the string srcStr to a metaphone string. It removes accents from srcStr before evaluating the metaphone value.

ngram

string ngram(string srcStr, string inSeparator, string outSeparator, integer size)

Takes`srcStr` as a sequence of items separated by inSeparator and creates another sequence of all n-grams of these items (subsequences of the original sequence) in the original order.

N-grams in the output sequence are separated by inSeparator, items inside n-grams by outSeparator. The size parameter is the "N", the size of the n-grams.

The length of both separators as well as of the n-grams has to be at least 1.

If there are not enough words for creating an n-gram, the function returns NULL.

preserveCase

string preserveCase(string srcStr, string maskStr)

Transforms all characters in srcStr to match the case of characters in maskStr. If maskStr is shorter than srcStr, the exceeding characters of srcStr are not modified.

removeAccents

string removeAccents(string srcStr)

Returns a copy of the string srcStr in which all characters containing diacritics are replaced by the corresponding characters without them. Only Latin characters with an accent defined in the Unicode Standard version 5.0.0 are changed.

replace

string replace(string srcStr, string what, string withWhat)

Replaces occurrences of the string what with the string withWhat in the string srcStr.

The function returns NULL when srcStr or what is null or an empty string. Overlapping occurrences of the string what are replaced only once, that is, replace("conoconoco", "conoco", "XXXX") returns "XXXXnoco".

replicate

string replicate(string srcStr, integer n)

Returns n copies of the string srcStr concatenated together without any separator. If n ⇐ 0 or srcStr = "", then the result value is NULL.

right

string right(string srcStr, integer count)

Returns a new string that is a substring of srcStr containing the number of characters taken from the end of srcStr.

sortWords

string sortWords(string srcStr [, string srcLocale=null [, string srcSeparator=" " [, boolean srcDesc=false]]])

Returns a string that consists of sorted parts of the string srcStr. If the parameter srcLocale is set, then the sort is done for the given locale.

The parameter srcSeparator specifies the separator of the string parts. If missing, set to null, or empty, the input string srcStr is parsed to separate characters that are then sorted.

If the Boolean parameter srcDesc is set to TRUE, reverse sort order is used.

If the string srcStr is null or empty, the function returns NULL.

soundex

string soundex(string srcStr)

Returns the soundex value of the srcStr parameter. It removes accents and non-ASCII characters from srcStr before evaluating the soundex value.

squeezeSpaces

string squeezeSpaces(string srcStr)

Removes whitespace characters from both ends of the string srcStr and reduces multiple whitespace characters within the string. The only whitespace character is the ' ' character.

substituteAll

string substituteAll(string srcPattern, string srcReplacement, string srcStr [, boolean caseInsensitiveFlag=false])

Replaces all occurrences of srcPattern in the string srcStr with srcReplacement.

If the parameter caseInsensitiveFlag is set to TRUE, then the search for srcPattern is case insensitive. For information about regular expressions, see Regular expressions.

substituteMany

string substituteMany(string srcPattern, string srcReplacement, string srcStr, integer srcVolume [, boolean caseInsensitiveFlag=false])

Replaces all occurrences of srcPattern in the string srcStr with srcReplacement, with the maximum number of replacements defined by the parameter srcVolume. If the total number of replacements in the string srcStr exceeds the srcVolume parameter, only the first srcVolume replacements are applied.

If the parameter caseInsensitiveFlag is set to TRUE, then the search for srcPattern is case insensitive. For information about regular expressions, see Regular expressions.

substr

string substr(string srcStr, integer beginIndex [, integer strLen])

Returns a new string that is a substring of the string srcStr. The substring begins at the index beginIndex and extends to the character at index beginIndex + strLen - 1.

If beginIndex < 0, then beginIndex is set to beginIndex + length(srcStr). If beginIndex is still < 0, beginIndex is set to 0.

If strLen is not specified or is greater than length(srcStr) - beginIndex, the rest of srcStr is returned. If strLen < 0, strLen is set to 0.

An empty substring is returned as a NULL string.

The index of the first character is 0.

transliterate

string transliterate(string srcStr, string charsFrom, string charsTo)

Transforms characters of the string srcStr.

The transformation replaces all occurrences of any character named in the parameter charsFrom with the corresponding character defined in the parameter charsTo at their corresponding positions, that is, transliterate("name@company.com","@","*") becomes "name*company.com".

trashConsonants

string trashConsonants(string srcStr)

Removes all consonants (only Latin letters are concerned) and their accented equivalents from the string srcStr. Other characters (digits, punctuation) remain unchanged.

trashDiacritics

Deprecated, replaced by the function removeAccents.

trashNonDigits

string trashNonDigits(string srcStr)

Returns a string that consists only of the digits included in the original string srcStr. All other characters are discarded.

trashNonLetters

string trashNonLetters(string srcStr)

Returns a string that consists only of the letters included in the original string srcStr. All other characters are discarded.

trashVowels

string trashVowels(string srcStr)

Removes all vowels (only Latin letters are concerned) and their accented equivalents from the string srcStr. Other characters (digits, punctuation) remain unchanged.

trim

string trim(string srcStr [, string fillChars])

Removes whitespace or specified characters from both ends of the string srcStr. Whitespace characters are '\t', '\n', '\f', '\r', and ' '.

For more information, see the trim method of the class java.lang.String in the Java API.

trimLeft

string trimLeft(string srcStr [, string fillChars])

Removes whitespace or specified characters from the left side of the string srcStr. Whitespace characters are '\t', '\n', '\f', '\r', and ' '.

For more information, see the trim method of the class java.lang.String in the Java API.

trimRight

string trimRight(string srcStr [, string fillChars])

Removes whitespace or specified characters from the right side of the string srcStr. Whitespace characters are '\t', '\n', '\f', '\r', and ' '.

For more information, see the trim method of the class java.lang.String in the Java API.

upper

string upper(string srcStr)

Transforms all characters of the string srcStr to uppercase.

word

string word(string srcStr, integer srcIdx [, string srcSeparator=" "])

Returns the srcIdx-th word from the string srcStr.

Words are defined as sequences of letters separated by srcSeparator. If the srcSeparator is missing or null, then the space character is assumed.

The index of the first word is 0.

wordCombinations

string wordCombinations(string srcStr, integer size, string srcSeparator, string combSeparator)

Returns all size-sized combinations of words in the string srcStr.

Words are defined as sequences of characters separated by srcSeparator. Leading, trailing, and multiple separators are ignored.

The words in the combination are separated by srcSeparator, the combinations are separated by combSeparator. The order of the words in the combinations is the same as the order of the words in the string srcStr.

If any of the arguments is null or size is less than 1, NULL is returned. If size is greater than or equal to the number of words in srcStr, srcStr is returned.

wordCount

integer wordCount(string srcStr [, string srcSeparator=" "])

Returns the number of non-empty words in the string srcStr.

Words are defined as sequences of characters separated by srcSeparator. Leading, trailing, and multiple separators are ignored. If the srcSeparator is missing or null, then the space character is assumed.

Coding functions

Functions for hashing, encoding, or decoding strings.

Function Usage Description

coding.fromBase64

string coding.fromBase64(string source)

Decodes the source string from Base64.

coding.HMacMD5

string coding.HMacMD5(string message, string key)

Creates a hash-based message authentication code using MD5 algorithm.

coding.HMacSHA1

string coding.HMacSHA1(string message, string key)

Creates a hash-based message authentication code using SHA-1 algorithm.

coding.HMacSHA512

string coding.HMacSHA512(string message, string key)

Creates a hash-based message authentication code using SHA-512 algorithm.

coding.htmlEscape

string coding.htmlEscape(string source, boolean allEntities)

Converts XML/HTML special characters (<, >, &, "…​) into their corresponding entities (&lt;, &gt;, &amp;, &quot;,…​). The optional allEntities parameter defaults to false.

If set to true, it also converts all non-ASCII characters into known entities or into the &#xHHHH; format.

coding.htmlUnescape

string coding.htmlUnescape(string source, boolean allEntities)

Transforms entities representing XML/HTML special characters (&lt;, &gt;, &amp;, &quot;,…​) back to their original symbols (<>&"…​). The optional allEntities parameter defaults to false. When true, it converts all entities, including those in the &#xHHHH; format, back to characters.

The entity &nbsp; is converted to a standard space.

coding.md5

string coding.md5(string source)

Creates a hash of the source string using MD5 algorithm.

coding.toBase64

string coding.toBase64(string source)

Encodes the source string to Base64.

coding.urlDecode

string coding.urlDecode(string source, string charset)

Decodes a URL-encoded string, converting hexadecimal byte representations (%HH) back to the original characters. The charset parameter is optional (default is according to system settings), and is essential for char-byte conversion.

coding.urlEncode

string coding.urlEncode(string source, string charset)

Encodes a string for safe URL use by converting unsupported characters into their hexadecimal (%HH) representations. Blank spaces are converted to the + symbol. Alphanumeric characters along with *-._ are inherently supported. The charset parameter is optional (default is according to system settings), and is essential for char-byte conversion.

encode.base64

Deprecated, use coding.toBase64 instead.

encode.HMacMD5

Deprecated, use coding.HMacMD5 instead.

encode.HMacSHA1

Deprecated, use coding.HMacSHA1 instead.

encode.HMacSHA512

Deprecated, use coding.HMacSHA512 instead.

encode.md5

Deprecated, use coding.md5 instead.

To use these functions, you need to have Protegrity Application Protector for Java installed and configured. For more information, see Protegrity Integration.

Function Usage Description

coding.protect

string coding.protect(string user, string dataElementName, string input)

Applies protection to the input string using Protegrity Application Protector through tokenization, encryption, or hashing methods.

coding.unprotect

string coding.unprotect(string user, string dataElementName, string input)

Removes protection from the input string using Protegrity Application Protector, reverting the tokenization or encryption process. Hashing cannot be reverted.

Math functions

Traditional mathematical functions. They return float value and usually have one float argument.

Function Usage Description

math.acos

float math.acos(float x)

Arc cosine of x. Defined for x ∈ [-1,1].

math.asin

float math.asin(float x)

Arc sine of x. Defined for x ∈ [-1,1].

math.atan

float math.atan(float x)

Arc tangent of x.

math.cos

float math.cos(float x)

Cosine of x. The argument is assumed in radians.

math.e

float math.e()

Returns Euler’s number (e) = 2,7182818285.

math.exp

float math.exp(float x)

Exponential function ex. Equivalent to math.pow(math.e(), x).

math.log

float math.log(float x)

Natural logarithm of x. Defined for x > 0 and math.exp(math.log(x)) = x.

math.log10

float math.log10(float x)

Common logarithm of x. Defined for x > 0 and math.pow(10, math.log10(x)) = x.

math.pi

float math.pi()

Returns π (PI) = 3,1415926536.

math.pow

  • long math.pow(long x, integer y)

  • float math.pow(float x, integer y)

  • float math.pow(float x, float y)

Returns x power to y. The function is undefined and returns NULL value for negative y and x = 0 or negative x and non-integer y. The function variant for integer x returns NULL when exponent y < 0 and x is not ±1.

math.sin

float math.sin(float x)

Sine of x. The argument is assumed in radians.

math.sqr

numeric math.sqr(numeric x)

Returns square of x.

math.sqrt

float math.sqrt(float x)

Square root of x. Defined for x >= 0.

math.tan

float math.tan(float x)

Tangent of x. The argument is assumed in radians. Undefined for π/2 + k*π.

Bitwise functions

Bitwise functions are logical operations applied on separate bits of the operands.

Function Usage Description

bitand

  • integer bitand(integer a, integer b)

  • long bitand(long a, long b)

Bitwise AND

bitneg

  • integer bitneg(integer a)

  • long bitneg(long a)

Bitwise NOT, or complement

bitor

  • integer bitor(integer a, integer b)

  • long bitor(long a, long b)

Bitwise inclusive OR

bitxor

  • integer bitxor(integer a, integer b)

  • long bitxor(long a, long b)

Bitwise exclusive or

MinMax functions

Functions used for computation of minimum or maximum values.

Function Usage Description

max

any max(any a, any b)

Returns the greater of two operands. If either of the operands is null, NULL is returned.

Strings are compared lexicographically. For Boolean values, the following applies: max(TRUE, ?) = TRUE.

min

any min(any a, any b)

Returns the lesser of two operands. If either of the operands is null, NULL is returned.

Strings are compared lexicographically. For Boolean values, the following applies: min(FALSE, ?) = FALSE.

safeMax

any safeMax(any a, any b)

Returns the greater of two operands. If either of the operands is null, then the value of the other operand is returned.

Strings are compared lexicographically. For Boolean values, the following applies: safeMax(TRUE, ?) = TRUE.

safeMin

any safeMin(any a, any b)

Returns the lesser of two operands. If either of the operands is null, then the value of the other operand is returned.

Strings are compared lexicographically. For Boolean values, the following applies: safeMin(FALSE, ?) = FALSE.

Aggregating functions

Aggregating functions are special functions that can be used only in the context of ONE steps that support grouping of records. Currently, there are two such steps: Representative Creator and Group Aggregator.

Depending on the context, expressions containing aggregate functions distinguish between two types of sources: inner, used in arguments of any aggregated function, and outer, used outside of functions. These might be generally different, for example when the sum of a certain attribute of all records in a group is added to another attribute of a record that has an entirely different format and usage.

Every aggregating function has a variant for conditional evaluation. The name of the variant is derived from the original name with the appended suffix if. The conditional variant has one extra argument that is inserted before the original arguments and contains a Boolean expression. The expression specifies when the appropriate record is included into aggregation. For example, the expression avg(salary) can have the conditional variant avgif(score < 100, salary).

Nesting of aggregate functions is not allowed. For example, the expression countif(salary < avg(salary)) is invalid.
Function Usage Description

avg

  • numeric avg(numeric expression)

  • datetype avg(datetype expression)

If the input is of an integer type, the function returns the average value of non-null values in a group. The result is an integer number rounded down. For example, avg(2, null, 4) = 6/2 = 3.

If the input is of a float type, the function returns the average value of non-null values in a group. The result is a number of a float type not rounded. For example avg(1.5, 2.6, 3.8) = 7.9/3 = 2.633333333333333.

If the input is of a day type, the function returns the average value of non-null values in a group. The result is a day type rounded down. For example, avg(2019-10-05, 2018-11-11, 2006-12-12) = 2015-03-10.

If the input is of a daytime type, the function returns the average value of non-null values in a group. The result is a daytime type rounded down. For example, avg(2019-10-05 10:20:30, 2018-11-11 00:00:00, 2006-12-12 12:00:00) = 2015-03-10 23:06:50.

avgif

  • numeric avgif(boolean boolExpression, numeric expression)

  • datetype avgif(boolean boolExpression, datetype expression)

Conditional version of avg.

See also the description of avg.

concatenate

string concatenate(string expression [, string separator=" "] [, integer maxLen])

Returns a concatenated string made up of non-null values in a group, separated by the value in separator (optional).

The resulting string never exceeds maxLen characters. Elements causing overflow are not added.

concatenateif

string concatenateif(boolean boolExpression, string expression [, string separator=" "], integer maxLen)

Conditional version of concatenate.

See also the description of concatenate.

count

integer count([any expression])

Returns the number of members or non-null expression values in a group. Both 32-bit and 64-bit integer data types are supported.

countDistinct

integer countDistinct(any expression)

Returns the number of distinct non-null values in a group.

countDistinctif

integer countDistinctif(boolean boolExpression, any expression)

Conditional version of countDistinct.

See also the description of countDistinct.

countif

integer countif(boolean boolExpression [, any expression])

Conditional version of count.

countUnique

integer countUnique(any expression)

Returns the number of non-null values in a group that occur only one time.

countUniqueif

integer countUniqueif(boolean boolExpression, any expression)

Conditional version of countUnique.

See also the description of countUnique.

detectCycles

  • string detectCycles(string parents, string children)

  • string detectCycles(string parents, string children, string separator)

Returns the first link of the oriented graph that creates a cycle. The function takes adjacency lists of parents and their children and performs recursive depth-first search to find cycles.

Returns NULL if a cycle is not found.

first

any first(any expression)

Returns the first value in a group (including null values). This aggregation value depends on the order of group members, which is given by context.

firstif

any firstif(boolean boolExpression, any expression)

Conditional version of first.

See also the description of first.

frequency

integer frequency(any expression)

The function assigns a number to each record that represents the total number of records in the same group (from the aggregation context) that share the same value of the expression argument.

The frequency function can be used to select a representative record as the one with the most or least common value of some attribute.

This function can be only used in Selection Rules section of Representative Creator step for sorting records and choosing the best record, which can be used later in other places (instance and representative attributes).

last

any last(any expression)

Returns the last value in a group (including null values). This aggregation value depends on the order of group members, which is given by context.

lastif

any lastif(boolean boolExpression, any expression)

Conditional version of last.

See also the description of last.

maximum

any maximum(any expression [, any result])

Returns the maximum of non-null values in a group. When the second argument is used, the function returns its (first non-null) value for record having maximum.

maximumif

any maximumif(boolean boolExpression, any expression [, any result])

Conditional version of maximum.

See also the description of maximum.

minimum

any minimum(any expression [, any result])

Returns the minimum of non-null values in a group. When the second argument is used, the function returns its (first non-null) value for record having minimum.

minimumif

any minimumif(boolean boolExpression, any expression [, any result])

Conditional version of minimum.

See also the description of minimum.

modus

any modus(any expression [, any result])

Returns the most frequent non-null value of expression in a group or, if specified, the first non-null value of result members having the most frequent non-null value expression. In case of more than one value with the same frequency, one of the matching values is chosen arbitrarily.

modusif

any modusif(boolean boolExpression, any expression [, any result])

Conditional version of modus.

Returns the first non-null value of result members having the most frequent non-null value expression. In case of more than one value with the same frequency, one of the matching values is chosen arbitrarily.

sum

  • numeric sum(numeric expression)

  • boolean sum(boolean expression)

Returns the sum of non-null values in a group.

For Boolean arguments, this function performs the logical sum (OR), for example, sum(true, true, false) = true. Both 32-bit and 64-bit integer data types are supported.

sumif

  • numeric sumif(boolean boolExpression, numeric expression)

  • boolean sumif(boolean boolExpression, boolean expression)

Conditional version of sum.

See also the description of sum.

Conditional expressions

Conditional expressions are a special type of expressions where the resulting value depends on the evaluation of certain conditions. These functions do not have strictly defined argument types, instead they are flexible and their arguments are defined by the specific functionality of each expression.

Function Usage Description

case

any case({boolean expr, any exprValue}…​ [, any defaultExpr=null])

Returns the value of expression exprValue immediately following the first expression expr whose value is TRUE. If none of the expressions expr is evaluated as TRUE, then defaultExpr is returned if defaultExpr is specified. Otherwise, NULL is returned.

The types of all values of exprValue must be the same.

decode

any decode(any decodeExpr {, any expr, any exprValue}…​ [, any defaultExpr=null])

Returns the value of expression exprValue immediately following the first expression expr whose value is equal to decodeExpr. If none of the expressions expr is evaluated as TRUE, then defaultExpr is returned if defaultExpr is specified. Otherwise, NULL is returned.

The types of all values of exprValue must be the same. Additionally, all types of the value of exprValue must correspond to the type of expression expr.

iif

any iif(boolean ifExpr, any trueExpr, any elseExpr)

Returns trueExpr if ifExpr is TRUE. If ifExpr is FALSE or UNKNOWN, elseExpr is returned.

nvl

any nvl({any expr}…​)

Returns the value of the first expression expr whose value is not-null. If no such value exists, then NULL is returned.

Example
case (
	id is null, "_" + input + "_",
	id = 1, substr(input, length(input) / 2),
	"default value"
)

decode (
	id,
	0,
	'zero',
	1,
	'one',
	2,
	'two',
	3,
	'three'
)

iif (
	value == 2,
	'ok',
	'bad'
)

nvl (
	value1,
	value2,
	value3
)

Conversion and formatting functions

Conversion functions are used for conversions and formatting the input expression.

Function Usage Description

getMilliseconds

long getMilliseconds(datetype srcExpression)

Returns internal representation of datetime value, that is, the number of milliseconds since the epoch which starts at January 1, 1970, 00:00:00 GMT. The returned value can be converted back into DAY/DATETIME using toDate or toDateTime function.

math.abs

numeric [math.]abs(numeric expr)

Returns the absolute value of the expression expr.

math.ceil, math.ceiling

  • integer [math.]ceil(float expr)

  • integer [math.]ceiling(float expr)

Converts the expression expr to the nearest higher integer value.

math.floor

integer [math.]floor(float expr)

Converts the expression expr to the nearest lower integer value.

math.longCeil, math.longCeiling

  • long [math.]longCeil(float expr) long [math.]longCeiling(float expr)

Converts the expression expr to the nearest higher long value.

math.longFloor

long [math.]longFloor(float expr)

Converts the expression expr to the nearest lower long value.

math.round

float [math.]round(float expr [, integer decimalPlaces=0])

Rounds the expression expr to a given number of decimal places, specified by decimalPlaces.

toDate

  • day toDate(datetime expr)

  • day toDate(long expr)

  • day toDate(string expr, string dateFormat [, string dateLocale])

Returns the date specified in the expression expr converted to date type DAY. If the conversion is not successful, then NULL is returned.

If a single argument is used, expr can be DAY or LONG expression. If expr is a STRING value, its format is defined by the dateFormat parameter (of type STRING) and the localization defined in the dateLocale parameter (of type STRING). The dateFormat and dateLocale strings depend on the classes SimpleDateFormat and Locale.

For example, if the date in string format is '30/05/2023', then the conversion is toDateTime('30/05/2023','dd/MM/yyyy'). You can also convert a date to string using the toString expression toString(date_col, 'yyyy-MM-dd') to get 2023-05-30 date.

toDateTime

  • datetime toDateTime(day expr)

  • datetime toDateTime(long expr)

  • datetime toDateTime(string expr, string dateFormat [, string dateLocale])

Returns the date specified in the expression expr converted to date type DATETIME. If the conversion is not successful, then NULL is returned.

If a single argument is used, expr can be DAY or LONG expression. If expr is a STRING value, its format is defined by the dateFormat parameter (of type STRING) and the localization defined in the dateLocale parameter (of type STRING). The dateFormat and dateLocale strings depend on the classes SimpleDateFormat and Locale.

For example, if the date in string format is '30/05/2023 15:00:00', then the conversion is toDateTime('30/05/2023 15:00:00','dd/MM/yyyy HH:mm:ss'). You can also convert a date to string using the toString expression toString(date_col, 'yyyy-MM-dd HH:mm:ss') to get 2023-05-30 15:00:00 date.

toFloat

  • float toFloat(string expr)

  • float toFloat(integer expr)

  • float toFloat(long expr)

Converts the expression expr to a FLOAT value. If the conversion is not successful, then NULL is returned.

toInteger

integer toInteger(string expr)

Converts the expression expr to an INTEGER value. If the conversion is not successful, then NULL is returned.

toLong

  • long toLong(string expr)

  • long toLong(integer expr)

Converts the expression expr to a LONG value. If the conversion is not successful, then NULL is returned.

toString

  • string toString(numeric expr [, string strFormat [, string strLocale]])

  • string toString(datetype expr, string strFormat [, string strLocale])

  • string toString(boolean expr)

  • string toString(string expr)

Converts the expression expr to a STRING value. If the conversion is not successful, then NULL is returned.

The parameter strFormat is required for expressions of type DATETIME (see the toDateTime function) or DAY. When only the expr parameter is set, then the default Java convert method (toString) is used for the conversion.

If the parameter strFormat is set, then it is used as the output format. If strLocale is not set, the default locale for the JVM instance is used.

If the strFormat parameter (and optionally, strLocale) is set, then only expressions of date or numeric types (DATETIME, DAY, INTEGER, LONG or FLOAT) can be converted. Conversions for other types with parameters strFormat and, optionally, strLocale specified are not defined. The strFormat and strLocale strings depend on the classes SimpleDateFormat and Locale.

XML functions

Functions for working with XML formatted data.

Function Usage Description

xpath

string xpath(string xml, string xpath)

Retrieves the XML element from the source XML string on the specified XPath.

Word set operation functions

Word set operation functions operate on two strings, interpreting them as sets of words separated by the given separator (or a space, by default) and return the integer cardinality of the resulting set or, in some cases, the resulting set itself.

If the parameter multiset is set to TRUE, the sets are treated as "multisets". For example, two identical words in one set form two members of the set rather than one. When multiset is FALSE (default), identical words are first deduplicated.

Three of the functions (intersection, difference, symmetricDifference) have variants with the prefix lcs (for example, lcsIntersection). These variants are based on the concept that the intersection of two (word) lists is defined as the Longest Common Subsequention. In this case, the order or duplicity of elements in the lists is significant.

For example, two lists a = 'C A D B A' and b = 'A E B C D', whose "ordinary" intersection is set 'A B C D', have the "lcs" intersection 'A B' (or 'C D', 'A D'), and so lcsIntersection(a, b) = 2, lcsDifference(a, b) = 2 (or 3 in the multiset case), lcsSymmetricDifference(a, b) = 5 (or 6).

The two types of difference functions can be executed with an optional integer parameter, singularity, which distinguishes sets that have common members from sets without common members. When this parameter is used, the function returns a value (typically a very large number) when the sets have an empty intersection.

For example: * difference('A B', 'C D') = 2: The typical result. The difference between completely different sets might have the same value as the difference between, for example, very similar sets 'A B C D E' and 'A B C F G'. * difference('A B', 'C D', 1000) = 1000: Using the singularity parameter produces a different result, which shows that the difference between completely different sets is high.

Most of the functions have a variant with the suffix Exp (for example, intersectionExp). These variants have one appended optional argument, comparator. This argument is a functional parameter and it defines how the elements in sets are compared to determine if they are the same. The functional parameter must take two string values and return a Boolean result. The result must not depend on the argument order.

The functional argument in general is similar to the function definition and has the following syntax:

(param1, param2, ..) { expression }

The columns and variables can be accessed in the function body.

When using the functional comparator, the calculation of intersection of two sets is quite different from an "ordinary" set operation, because we cannot assume the comparator conforms to the equality relation. In both set operands A respectively B, we determine subsets A' respectively B' of members that have at least one matching member in the opposite set. Since these subsets might have different sizes, the size of the intersection is defined as the first one.

Examples of using these functional parameters:

  • symmetricDifference(std_name, std_hh_names, ' ', multi): Ordinary comparison.

  • symmetricDifferenceExp(std_name, std_hh_names, ' ', multi, (x,y) { x == y }): Ordinary comparison, equivalent to the non-Exp behavior.

  • symmetricDifferenceExp(std_name, std_hh_names, ' ', multi, (x,y) { editDistance(x,y) < 2 }): Approximative matching, elements different at most in one error are assumed to be equal.

When std_name = "AAA BBB AAA ABC" and std_hh_names = "AAB AAA CCC BBC" and multi = TRUE, then the first case gives 6 (one pair of AAA), the second one is 6 (all three AAA paired), and the third one is 1 (only CCC remains unpaired).

Function Usage Description

set.approxSymmetricDifference

integer [set.]approxSymmetricDifference(string set1, string set2 [, string separator=" "] [, boolean multiset=false] [, integer tolerance] [, string diff_function] [, integer singularity])

Returns the number of words which cannot be matched with any word in the other set. The matching is based on one of difference functions, 'levenshtein' (default), 'editDistance', or 'hamming', and the tolerance as the maximum number of changes in a word. If the tolerance is specified as a non-integer number less than 1 and greater than 0, the means is relative - the ratio of the number of changes divided by the length of longer word.

This function is equivalent to thegeneric function:

symmetricDifferenceExp(set1, set2, separator, multiset, singularity, (x,y) { diff_function(x,y) <= tolerance })

Or:

or symmetricDifferenceExp(set1, set2, separator, multiset, singularity, (x,y) { diff_function(x,y) / max(length(x), length(y)) <= tolerance })

set.contains

boolean set.contains(string str, string item [, string separator=" "])

Checks whether the item is contained in the string set.

set.containsExp

boolean set.containsExp(string str, string item [, string separator=" "], function comparator)

Checks whether the item is contained in the string set. The elements are compared according to the specified comparator.

set.difference

integer [set.]difference(string set1, string set2 [, string separator=" "] [, boolean multiset=false] [, integer singularity])

Returns the cardinality of the difference of sets (set1 \ set2).

set.differenceExp

integer set.differenceExp(string set1, string set2 [, string separator=" "] [, boolean multiset=false] [, integer singularity], function comparator)

Returns the cardinality of the difference of sets (set1 \ set2).

set.differenceResult

string set.differenceResult(string set1, string set2 [, string separator=" "] [, boolean multiset=false])

Returns the difference of sets (set1 \ set2).

set.differenceResultExp

string set.differenceResultExp(string set1, string set2 [, string separator=" "] [, boolean multiset=false], function comparator)

Returns the difference of sets (set1 \ set2).

set.distinct

string set.distinct(string str [, string separator=" "])

Returns the set of distinct elements from the source string.

set.distinctExp

string set.distinctExp(string str [, string separator=" "], function comparator)

Returns the set of distinct elements from the source string. The elements are compared according to the specified comparator.

set.filterExp

string set.filterExp(set, string separator=" ", function callback)

Returns the string set containing elements for which the Boolean functional parameter is true. Equivalent to set.mapExp(set, separator, (x) { iif(function(x), x, "") }).

For example, filterExp('John W. Smith', ' ', (x) { length(x) > 2 }) returns "John Smith".

set.indexOf

integer set.indexOf(string str, string item [, string separator=" "])

Returns the position of item in the string set. If the item is not present, it returns NULL.

set.indexOfExp

integer set.indexOfExp(string str, string item [, string separator=" "], function comparator)

Returns the position of item in the string set. If the item is not present, it returns NULL. The elements are compared according to the specified comparator.

set.intersection

integer [set.]intersection(string set1, string set2 [, string separator=" "] [, boolean multiset=false])

Returns the cardinality of the intersection of sets.

set.intersectionExp

integer set.intersectionExp(string set1, string set2 [, string separator=" "] [, boolean multiset=false], function comparator)

Returns the cardinality of the intersection of sets.

set.intersectionResult

string set.intersectionResult(string set1, string set2 [, string separator=" "] [, boolean multiset=false])

Returns the intersection of sets.

set.intersectionResultExp

string set.intersectionResultExp(string set1, string set2 [, string separator=" "] [, boolean multiset=false], function comparator)

Returns the intersection of sets.

set.item

string set.item(string str, integer index [, string separator=" "])

Returns the element of the string set at the specified position, including empty elements.

set.lastIndexOf

integer set.lastIndexOf(string str, string item [, string separator=" "])

Returns the position of the last item in the string set. If the item is not present, it returns NULL.

set.lastIndexOfExp

integer set.lastIndexOfExp(string str, string item [, string separator=" "], function comparator)

Returns the position of the last item in the string set. If the item is not present, it returns NULL. The elements are compared according to the specified comparator.

set.lcsDifference

integer set.lcsDifference(string set1, string set2 [, string separator=" "] [, boolean multiset=false] [, integer singularity])

Returns the cardinality of the difference of sets (set1 \ set2) (lcs-based intersection).

set.lcsDifferenceExp

integer set.lcsDifferenceExp(string set1, string set2 [, string separator=" "] [, boolean multiset=false] [, integer singularity], function comparator)

Returns the cardinality of the difference of sets (set1 \ set2) (lcs-based intersection).

set.lcsDifferenceResult

string set.lcsDifferenceResult(string set1, string set2 [, string separator=" "] [, boolean multiset=false])

Returns the difference of sets (set1 \ set2) (lcs-based intersection).

set.lcsDifferenceResultExp

string set.lcsDifferenceResultExp(string set1, string set2 [, string separator=" "] [, boolean multiset=false], function comparator)

Returns the difference of sets (set1 \ set2) (lcs-based intersection).

set.lcsIntersection

integer set.lcsIntersection(string set1, string set2 [, string separator=" "] [, boolean multiset=false])

Returns the cardinality of the intersection of sets (lcs-based intersection).

set.lcsIntersectionExp

integer set.lcsIntersectionExp(string set1, string set2 [, string separator=" "] [, boolean multiset=false], function comparator)

Returns the cardinality of the intersection of sets (lcs-based intersection).

set.lcsIntersectionResult

string set.lcsIntersectionResult(string set1, string set2 [, string separator=" "] [, boolean multiset=false])

Returns the intersection of sets (lcs-based intersection).

set.lcsIntersectionResultExp

string set.lcsIntersectionResultExp(string set1, string set2 [, string separator=" "] [, boolean multiset=false], function comparator)

Returns the intersection of sets (lcs-based intersection).

set.lcsSymmetricDifference

integer set.lcsSymmetricDifference(string set1, string set2 [, string separator=" "] [, boolean multiset=false] [, integer singularity])

Returns the cardinality of the symmetric difference of sets (lcs-based intersection).

set.lcsSymmetricDifferenceExp

integer set.lcsSymmetricDifferenceExp(string set1, string set2 [, string separator=" "] [, boolean multiset=false] [, integer singularity], function comparator)

Returns the cardinality of the symmetric difference of sets (lcs-based intersection).

set.lcsSymmetricDifferenceResult

string set.lcsSymmetricDifferenceResult(string set1, string set2 [, string separator=" "] [, boolean multiset=false])

Returns the symmetric difference of sets (lcs-based intersection).

set.lcsSymmetricDifferenceResultExp

string set.lcsSymmetricDifferenceResultExp(string set1, string set2 [, string separator=" "] [, boolean multiset=false], function comparator)

Returns the symmetric difference of sets (lcs-based intersection).

set.mapExp

string set.mapExp(set [, string separator=" "], function callback)

Returns a string set whose each element is transformed by the functional parameter. For example, mapExp('John Walker Smith', ' ', (x) { left(x,3) }) returns "Joh Wal Smi".

set.size

integer set.size(string str [, string separator=" "])

Returns the number of elements in the string, including empty elements.

set.sort

string set.sort(string str [, string separator=" "] [, string locale=null] [, boolean desc=false])

Returns a sorted set of elements.

set.subSequence

string set.subSequence(string str, integer from [, integer count=null] [, string separator=" "])

Returns elements of the string set starting at the specified position, including empty elements.

The from index can be negative, in which case the starting index is assumed from the last (set.subSequence("A B C D", -3, 2) == "B C"). When the`count` is not specified or null, all remaining elements are selected (set.subSequence("A B C D", -3) == "B C D").

set.sumExp

numeric set.sumExp(set [, string separator=" "], function callback)

Returns the sum of values calculated by the functional parameter for each element. The functional parameter expression must return an integer, long, or float result.

For example, sumExp('John Walker Smith', ' ', (x) { length(x) }) returns 4 + 6 + 5 = 15. The result for an empty set (null or an empty string in the first argument) is 0.

set.symmetricDifference

integer [set.]symmetricDifference(string set1, string set2 [, string separator=" "] [, boolean multiset=false] [, integer singularity])

Returns the cardinality of the symmetric difference of sets.

set.symmetricDifferenceExp

integer set.symmetricDifferenceExp(string set1, string set2 [, separator=" "] [, boolean multiset=false] [, integer singularity], function comparator)

Returns the cardinality of the symmetric difference of sets.

set.symmetricDifferenceResult

string set.symmetricDifferenceResult(string set1, string set2 [, string separator=" "] [, boolean multiset=false])

Returns the symmetric difference of sets.

set.symmetricDifferenceResultExp

string set.symmetricDifferenceResultExp(string set1, string set2 [, string separator=" "] [, boolean multiset=false], function comparator)

Returns the symmetric difference of sets.

Example:

  • Data sources: src_string1 = "alfa beta", src_string2 = "alfa alfa beta".

  • Expression: set.symmetricDifferenceResultExp(src_string1, src_string2, ' ', true, (x,y) { x=y }).

  • Result: alfa.

set.union

integer [set.]union(string set1, string set2 [, string separator=" "] [, boolean multiset=false])

Returns the cardinality of the union of sets.

set.unionExp

integer set.unionExp(string set1, string set2 [, string separator=" "] [, boolean multiset=false], function comparator)

Returns the cardinality of the union of sets.

set.unionResult

string set.unionResult(string set1, string set2 [, string separator=" "] [, boolean multiset=false])

Returns the union of sets.

set.unionResultExp

string set.unionResultExp(string set1, string set2 [, string separator=" "] [, boolean multiset=false], function comparator)

Returns the union of sets.

Regular expressions

The syntax for regular expressions in ONE follows the rules for regular expressions used in Java, described in Class Pattern documentation.

The following sections contain information about regular expression usage extensions in ONE.

' @" ' - syntax (single escaping)

When writing regular expressions, take into consideration that a regular expression is manipulated as a Java string. In literal Java strings, the backslash (\) is an escape character. The literal string \\ is a single backslash.

In regular expressions, the backslash is also an escape character. The regular expression \\ matches a single backslash. As a Java string, this regular expression becomes \\\\.

To avoid having to use double escaping, prefix the string in quotes with an at sign (@). In that case, the string between the at sign and double quotation marks (@"some_string") is taken as a literal and no characters are considered escape characters in the context of the Java string.

For example, to substitute all occurrences of the caret character (^) and the closing bracket (]) with x in the string ab[^] (the expected result is the string ab[xx), you could use the following expression:

substituteAll("[\\^\\]]","x","ab[^]")

Or, using the ' @" ' syntax:

substituteAll(@"[\^\]]","x","ab[^]")

Capturing groups

Matching regular expressions in the input is done by analyzing the input expression string (the string resulting from applying the expression to the input). Sections of the input string (called capturing groups, enclosed in parentheses) are identified and marked for further use in creating the output. These capturing groups can be referenced using backreference.

In the case of a match, the matched data from the input is sent to the predefined output columns. Each output column has a substitution property that is the value which will be sent to the output. It can contain the previously mentioned backreferences with the following syntax:

  • $i, where i is a number between 0 and 9`: Backreference to a capturing group with group number lower than 10.

  • ${i}, where i is a natural number other than 0: Backreference to a capturing group with any natural group number.

  • $`: Returns the substring before the processed (matched) part of the input string.

  • $': Returns the substring after the processed (matched) part of the input string.

  • $&: Returns the processed (matched) part of the input string.

  • $$: returns the dollar sign characters ($).

The capturing groups can be used in expressions substituteAll, substitutaMany, and in the Regex Matching step.

For example, to substitute all pairs of letter-digit couples with only the digit from the couple (that is, the input string a1b2c3d4e5 leads to the output 12345), you could use the following expression:

substituteAll("([a-z])([0-9])","${2}","a1b2c3d4e5")

Was this page useful?