Table of Contents
Generate
Generate allows you to generate patterns or phonetic tokens used for deduplication, including: DQ Fonetix™, DQSoundex™, and DQMetaphone™.
Pattern
Generate a Pattern representation of your data e.g.
Input | Output |
---|---|
Thiss 112233 Is $$a Tesst!? | Ccvcc112211VcvCvcccPP |
Within the task pane, you can find sets of rules to customise the generated pattern. e.g. excluding certain characters.
If there are values present within the input in which you do not wish to tokenize, you can select these parameters within the Do Not Tokenize dropdown menu found within the Generate task pane.
Settings
Letter Settings
Case Sensitive - Set whether or not to differentiate between casing when generating a pattern.
E.g. when Case Sensitive is set to False:
Input | Output |
---|---|
Thiss 112233 Is $$a Tesst!? | ccvcc112211vcvcvcccpp |
Categorise Letters - Set whether or not to differentiate between vowels and consonants when generating a pattern.
E.g. when Categorise Letters is set to False:
Input | Output |
---|---|
Thiss 112233 Is $$a Tesst!? | Aaaaa112211AaaAaaaaPP |
Exclude Letters - Set whether or not to exclude vowels or consonants when generating a pattern.
E.g. when Exclude Letters is set to exclude Vowels:
Input | Output |
---|---|
Thiss 112233 Is $$a Tesst!? | Cccc112211cCcccPP |
Collapse Letters - Set whether or not to collapse vowels and consonants when generating a pattern.
E.g. when Collapse Letters is set to collapse consonants:
Input | Output |
---|---|
Thiss 112233 Is $$a Tesst!? | Ccvc112211VcvCvccPP |
Number Settings
Categorise Numbers - Set whether or not to differentiate between odd and even numbers when generating a pattern.
E.g. when Categorise Numbers is set to False:
Input | Output |
---|---|
Thiss 112233 Is $$a Tesst!? | Ccvcc999999VcvCvcccPP |
Exclude Numbers - Set whether or not to exclude numbers (odd and/or even) when generating a pattern.
E.g. when Exclude Numbers is set to exclude odd numbers:
Input | Output |
---|---|
Thiss 112233 Is $$a Tesst!? | Ccvcc22VcvCvcccPP |
Collapse Numbers - Set whether or not to collapse numbers (odd and/or even) when generating a pattern.
E.g. When Collapse Numbers is set to Odd:
Input | Output |
---|---|
Thiss 112233 Is $$a Tesst!? | Ccvcc1221VcvCvcccPP |
Symbol Settings
Exclude Symbols - Set whether or not to exclude symbols when generating a pattern.
E.g. when Exclude Symbols is set to False:
Input | Output |
---|---|
Thiss 112233 Is $$a Tesst!? | Ccvcc112211VcSvCvcccPP |
Collapse Symbols - Set whether or not to collapse symbols when generating a pattern.
E.g. when Collapse Symbols is set to False (alongside Exclude symbols being set to False):
Input | Output |
---|---|
Thiss 112233 Is $$a Tesst!? | Ccvcc112211VcSSvCvcccPP |
White Space Settings
Exclude White Space - Set whether or not to exclude white spaces when generating a pattern.
E.g. when Exclude White Space is set to False:
Input | Output |
---|---|
Thiss 112233 Is $$a Tesst!? | CcvccW112211WVcWvWCvcccPP |
Collapse White Space - Set whether or not to collapse white spaces when generating a pattern.
E.g when Collapse White Space is set to False (alongside Exclude White Spaces being set to False):
Input | Output |
---|---|
Thiss 112233 Is $$a Tesst!? | CcvccWW112211WVcWvWCvcccPP |
Non-Printing Settings
Exclude Non-Printing Characters - Set whether or not to exclude non-printing characters when generating a pattern.
E.g. when Exclude Non-Printing Characters is set to False:
Input | Output |
---|---|
Thiss 112233 Is $$a Tesst!? | Ccvcc112211VcvCNNvcccPP |
Collapse Non-Printing Characters - Set whether or not collapse non-printing characters when generating a pattern.
E.g. When Collapse Non-Printing Characters is set to False (alongside Exclude Non-Printing Characters being set to False)
Input | Output |
---|---|
Thiss 112233 Is $$a Tesst!? | Ccvcc112211VcvCNvcccPP |
Token
DQ for Excel™ allows a user to generate a phonetic token in line with a selected phonetic algorithm. This includes common phonetic algorithms such as Soundex and Metaphone as well as DQ Global's own custom algorithms. Descriptions and examples of these phonetic algorithms can be found below.
Soundex
Soundex retains the first letter of the input string to formulate its match token. Soundex removes vowels (a, e, i, o, u) and h and w from the input string. The remaining letters are assigned numbers using a lookup table to produce a token of 4 characters.
This means ‘Cathy’ and ‘Kathy’ will not match as their match tokens begin with a ‘C’ from Cathy and a ‘K’ from Kathy. As such, Soundex does not match well where the start of a word sounds the same but is not the same. Also, due to the numeric substitution it is possible to be shown non-matches (false positive) matches.
Metaphone
Metaphone improves the Soundex algorithm by using information about variations and inconsistencies in English spelling and pronunciation, to produce a more accurate encoding. This allows you to find more precise matches than the simple Soundex algorithm. Metaphone considers a larger set of character transformations than Soundex and therefore analyses a string phonetically with far more accuracy.
DQ for Excel™ also supports Double Metaphone, which is considered to make a number of fundamental design improvements over the original Metaphone algorithm.
DQSoundex
DQSoundex overloads Soundex with the advanced capabilities of DQFonetix™. This improves the start of word logic and modifies the first letter(s) of an input string. DQSoundex will de-pluralise and pre-process the start of words to manage variances like ‘C’ to ‘K’ as in 'Cathy' and 'Kathy', as well as ‘Ph’ as in Phonetix to ‘F’ in Fonetix.
DQMetaphone
DQMetaphone like DQSoundex is an enhanced Metaphone technology with the advanced capabilities of DQFonetix™. This improves the start of word logic and modifies the first letter(s) of an input string. DQSoundex will de-pluralise and pre-process the start of words to manage variances and improve matching.
In the case shown below (Christopher), Metaphone would have generated three of five names matches. However, after running DQ’s advanced algorithms and advanced logic, DQMetaphone allows ‘Kh’ from 'Khristopher' to match with the ‘Ch’ from 'Christopher'. Thus generating the same match key token.
DQFonetix™
DQFonetix™ contains our advanced phonetic algorithms developed over the last 25 years by DQ Global. The algorithm is property DQ Global and hence we do not share the specification of the process. However, DQFonetix™ has four key features:
- Five spoken languages – English, Spanish, French, Italian and German
- Avoids false matches
- Overcomes character variances
- Deals with diacritics
DQFonetix™ provides the most varied matching window to highlight duplicate matches that may not be picked up – or falsely matches - in Soundex and Metaphone.
Other
DQ for Excel™ also supports the Caverphone, Caverphone 2 and NYSIIS algorithms.
Custom Functions
DQ.GENERATE_TOKEN
The DQ.GENERATE_TOKEN function generates a token from a given input.
Parameters:
input (string), text used for generation
algorithm (string), tokenization algorithm, from list:
- DQ_Fonetix
- DQ_Metaphone
- DQ_Soundex
- Soundex
- Metaphone
- Double_Metaphone
- Caverphone
- Caverphone_2
- Nysiis
language (string), language of input, from list:
- English
- Spanish
- French
- Italian
- German
returns (string), generated token