User Tools

Site Tools


generate

Generate

Generate allows you to generate patterns or phonetic tokens used for deduplication, including: DQ Fonetix™, DQSoundex™, and DQMetaphone™.


Pattern

Generate a Pattern representation of your data e.g.

Input Output
Thiss 112233 Is $$a Tesst!? Ccvcc112211VcvCvcccPP

Within the task pane, you can find sets of rules to customise the generated pattern. e.g. excluding certain characters.

If there are values present within the input in which you do not wish to tokenize, you can select these parameters within the Do Not Tokenize dropdown menu found within the Generate task pane.


Settings

Letter Settings

Case Sensitive - Set whether or not to differentiate between casing when generating a pattern.

E.g. when Case Sensitive is set to False:

Input Output
Thiss 112233 Is $$a Tesst!? ccvcc112211vcvcvcccpp

Categorise Letters - Set whether or not to differentiate between vowels and consonants when generating a pattern.

E.g. when Categorise Letters is set to False:

Input Output
Thiss 112233 Is $$a Tesst!? Aaaaa112211AaaAaaaaPP

Exclude Letters - Set whether or not to exclude vowels or consonants when generating a pattern.

E.g. when Exclude Letters is set to exclude Vowels:

Input Output
Thiss 112233 Is $$a Tesst!? Cccc112211cCcccPP

Collapse Letters - Set whether or not to collapse vowels and consonants when generating a pattern.

E.g. when Collapse Letters is set to collapse consonants:

Input Output
Thiss 112233 Is $$a Tesst!? Ccvc112211VcvCvccPP

Number Settings

Categorise Numbers - Set whether or not to differentiate between odd and even numbers when generating a pattern.

E.g. when Categorise Numbers is set to False:

Input Output
Thiss 112233 Is $$a Tesst!? Ccvcc999999VcvCvcccPP

Exclude Numbers - Set whether or not to exclude numbers (odd and/or even) when generating a pattern.

E.g. when Exclude Numbers is set to exclude odd numbers:

Input Output
Thiss 112233 Is $$a Tesst!? Ccvcc22VcvCvcccPP

Collapse Numbers - Set whether or not to collapse numbers (odd and/or even) when generating a pattern.

E.g. When Collapse Numbers is set to Odd:

Input Output
Thiss 112233 Is $$a Tesst!? Ccvcc1221VcvCvcccPP

Symbol Settings

Exclude Symbols - Set whether or not to exclude symbols when generating a pattern.

E.g. when Exclude Symbols is set to False:

Input Output
Thiss 112233 Is $$a Tesst!? Ccvcc112211VcSvCvcccPP

Collapse Symbols - Set whether or not to collapse symbols when generating a pattern.

E.g. when Collapse Symbols is set to False (alongside Exclude symbols being set to False):

Input Output
Thiss 112233 Is $$a Tesst!? Ccvcc112211VcSSvCvcccPP

White Space Settings

Exclude White Space - Set whether or not to exclude white spaces when generating a pattern.

E.g. when Exclude White Space is set to False:

Input Output
Thiss 112233 Is $$a Tesst!? CcvccW112211WVcWvWCvcccPP

Collapse White Space - Set whether or not to collapse white spaces when generating a pattern.

E.g when Collapse White Space is set to False (alongside Exclude White Spaces being set to False):

Input Output
Thiss 112233 Is $$a Tesst!? CcvccWW112211WVcWvWCvcccPP

Non-Printing Settings

Exclude Non-Printing Characters - Set whether or not to exclude non-printing characters when generating a pattern.

E.g. when Exclude Non-Printing Characters is set to False:

Input Output
Thiss 112233 Is $$a Tesst!? Ccvcc112211VcvCNNvcccPP

Collapse Non-Printing Characters - Set whether or not collapse non-printing characters when generating a pattern.

E.g. When Collapse Non-Printing Characters is set to False (alongside Exclude Non-Printing Characters being set to False)

Input Output
Thiss 112233 Is $$a Tesst!? Ccvcc112211VcvCNvcccPP

Token

DQ for Excel™ allows a user to generate a phonetic token in line with a selected phonetic algorithm. This includes common phonetic algorithms such as Soundex and Metaphone as well as DQ Global's own custom algorithms. Descriptions and examples of these phonetic algorithms can be found below.

Soundex

Soundex retains the first letter of the input string to formulate its match token. Soundex removes vowels (a, e, i, o, u) and h and w from the input string. The remaining letters are assigned numbers using a lookup table to produce a token of 4 characters.

This means ‘Cathy’ and ‘Kathy’ will not match as their match tokens begin with a ‘C’ from Cathy and a ‘K’ from Kathy. As such, Soundex does not match well where the start of a word sounds the same but is not the same. Also, due to the numeric substitution it is possible to be shown non-matches (false positive) matches.

Metaphone

Metaphone improves the Soundex algorithm by using information about variations and inconsistencies in English spelling and pronunciation, to produce a more accurate encoding. This allows you to find more precise matches than the simple Soundex algorithm. Metaphone considers a larger set of character transformations than Soundex and therefore analyses a string phonetically with far more accuracy.

DQ for Excel™ also supports Double Metaphone, which is considered to make a number of fundamental design improvements over the original Metaphone algorithm.

DQSoundex

DQSoundex overloads Soundex with the advanced capabilities of DQFonetix™. This improves the start of word logic and modifies the first letter(s) of an input string. DQSoundex will de-pluralise and pre-process the start of words to manage variances like ‘C’ to ‘K’ as in 'Cathy' and 'Kathy', as well as ‘Ph’ as in Phonetix to ‘F’ in Fonetix.

DQMetaphone

DQMetaphone like DQSoundex is an enhanced Metaphone technology with the advanced capabilities of DQFonetix™. This improves the start of word logic and modifies the first letter(s) of an input string. DQSoundex will de-pluralise and pre-process the start of words to manage variances and improve matching.

In the case shown below (Christopher), Metaphone would have generated three of five names matches. However, after running DQ’s advanced algorithms and advanced logic, DQMetaphone allows ‘Kh’ from 'Khristopher' to match with the ‘Ch’ from 'Christopher'. Thus generating the same match key token.

DQFonetix™

DQFonetix™ contains our advanced phonetic algorithms developed over the last 25 years by DQ Global. The algorithm is property DQ Global and hence we do not share the specification of the process. However, DQFonetix™ has four key features:

  • Five spoken languages – English, Spanish, French, Italian and German
  • Avoids false matches
  • Overcomes character variances
  • Deals with diacritics

DQFonetix™ provides the most varied matching window to highlight duplicate matches that may not be picked up – or falsely matches - in Soundex and Metaphone.

Other

DQ for Excel™ also supports the Caverphone, Caverphone 2 and NYSIIS algorithms.


Custom Functions

DQ.GENERATE_TOKEN

The DQ.GENERATE_TOKEN function generates a token from a given input.

Parameters:

input (string), text used for generation

algorithm (string), tokenization algorithm, from list:

  • DQ_Fonetix
  • DQ_Metaphone
  • DQ_Soundex
  • Soundex
  • Metaphone
  • Double_Metaphone
  • Caverphone
  • Caverphone_2
  • Nysiis

language (string), language of input, from list:

  • English
  • Spanish
  • French
  • Italian
  • German

returns (string), generated token

generate.txt · Last modified: 2023/02/24 15:37 by hayden.law

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki