Discoverers

Discoverers are used to find specific values or types of values within the chosen data source, environment, schema and/or table and identify them as sensitive data. They are grouped by country or usage, but there are several discoverers that have general uses when more specific discoverers are not what we need.

Some discoverers make use of picklists, a large collection of first names, company names, cities, etc. which are used to identify certain types of sensitive data. Some discoverers make use of the database metadata (e.g. the names of the columns) in finding columns that contain sensitive data.

Table of contents
Croatia
Financial Data
General
Geographical Data
Human Data
Other
Switzerland
USA

Croatia

Discoverers in this category are used to identify sensitive data specific to Croatia.

Name Description
Tax identification (JMBG) Values identified as JMBGs have to be exactly 13 digits long. This data can be found anywhere in the string, i.e. the value will be found even if it has other characters before and/or after the JMBG. If a given value is 12 digits long, a zero is added to the start of the value. The first two digits have to be smaller than 32, but cannot have the value "00". The control digit of the JMBG is then calculated and needs to be correct for a value to be identified as an JMBG.
Table metadata is also analyzed when using this discoverer. In particular, the discoverer looks for tables with names containing the following values: JMBG
Bank account Values identified as bank account numbers have to be exactly 10 digits long. The control digit of the bank account number is then calculated and needs to be correct for a value to be identified as a bank account number.
BBAN Values identifies as BBANs have to be exactly 17 digits long. The first seven digits represent the bank number which has to match with one of the values in the bank picklist. If the bank number is correct, the other ten digits are then checked with the Bank account discoverer. If the check is passed, the value is identified as a BBAN.
Drivers license Values identified as drivers license numbers have to be exactly eight digits long. The control digit of the drivers license number is then calculated and needs to be correct for a value to be identified as a drivers license number.
Health insurance number Values identified as health insurance numbers have to be exactly nine digits long. This data can be found anywhere in the string, i.e. the value will be found even if it has other characters before and/or after the health insurance number. The control digit of the health insurance number is then calculated and needs to be correct for a value to be identified as a health insurance number.
Tax identification (Matični broj) Values identified as matični broj have to be between six and eight digits long. This data can be found anywhere in the string, i.e. the value will be found even if it has other characters before and/or after the matični broj. If a given value is shorter than eight digits, zeros are added to the start of the value until the number of digits is eight. The control digit of the matični broj is then calculated and needs to be correct for a value to be identified as a matični broj.
Phone number Values identified as phone numbers have to match with one of two phone number formats. The first one describes phone numbers in Croatia and values written in this format have a correct phone number prefix, a correct area or network number and have to end with a number which has to be between six and seven digits long. Examples of phone numbers of this format are listed below:
+887566,
+404865123,
+99123455634234.

The second format describes phone numbers within the European Union and values written in this format have a correct phone number prefix followed by a number which has to be between 1 and 14 digits long. Examples of phone numbers of this format are listed below:
01123456,
38523111222,
+3858001234567.

Table metadata is also analyzed when using this discoverer. In particular, the discoverer looks for tables with names containing the following values:
PHONE, MOB, TEL
Tax identification (OIB) Values identified as OIBs have to be exactly 11 digits long. This data can be found anywhere in the string, i.e. the value will be found even if it has other characters before and/or after the OIB. The control digit of the OIB is then calculated and needs to be correct for a value to be identified as a OIB. Table metadata is also analyzed when using this discoverer. In particular, the discoverer looks for tables with names containing the following values:
OIB, TAX_N, TAXN

Financial data

Discoverers in this category are used to find values in the database that correspond to general financial data that isn't specific to a certain country.

Name Description
Credit card number Values identified as credit card numbers have to be between 13 and 19 digits long. This data can be found anywhere in the string, i.e. the value will be found even if it has other characters before and/or after the credit card number. Credit card numbers must not start with digits "00" or "99".
Table metadata is also analyzed when using this discoverer. In particular, the discoverer looks for tables with names containing the following values:
CCARD, CREDITC, CREDIT_CARD, CARD, KARTIC
Business Identifier Code (BIC/SWIFT) Values identified as BIC/SWIFT are those that correspond to the following formats:
BANKCC11,
BANKCC11XXX.

The first six characters have to be uppercase letters followed by two characters that are either an uppercase letter or a digit. If the last tree characters are present, they have to be uppercase letters or digits.
IBAN Values identified as IBANs have to be longer than three characters. The first two characters represent the country code and need to be uppercase letters (values between "AA" and "ZZ"). The control digit of the IBAN is then calculated and needs to be correct for a value to be identified as an IBAN. Table metadata is also analyzed when using this discoverer. In particular, the discoverer looks for tables with names containing the following values:
IBAN

Other

Discoverers in this category are those that don't belong in any of the other categories.

Name Description
Company A value will be identified as a company name if it exists within the company picklist. A countryCode can be provided to only look for companies from a specific country (e.g. "DE" for Germany).

Geographical Data

Discoverers in this category are used to identify geographical data (city names, etc.).

Name Description
City A value will be identified as a city name if it exists within the place picklist. A countryCode can be provided to only look for cities from a specific country (e.g. "DE" for Germany). Table metadata is also analyzed when using this discoverer. In particular, the discoverer looks for tables with names containing the following values: CITY

USA

Discoverers in this category are used to identify sensitive data specific to the United States of America.

Name Description
USA Social Security Number Values identified as USA social security numbers have to be either nine digits long, or eleven digits long if they contain hyphens (which have to be located in the fourth and seventh place within the value). USA social security numbers are divided into three separate values that all have their own criteria:
1. The first three digits must not be equal to the following values: "666", "900", "999", "000",
2. The third and fourth digit must not be equal to "00",
3. The last four digits must not be equal to "0000",

The entire USA social security number must not be equal to the following values: "078051120", "219099999". All digits that form a USA social security number must not be the same, and values must not be incrementing in regards to the previous digits (e.g. value "123456789" is not allowed).

Human Data

Discoverers in this category are used to identify sensitive data pertaining to most people, regardless of country (e.g. first name or e-mail).

Name Description
Email Values identified as e-mails need to comform to the correct e-mail format. Examples of various correct e-mails are listed below:
john117@mail.com,
john_doe@mail.com,
john.doe@mail.com.hr,
john.doe@mail-archive.com,
john.doe_117@mail-archive.com.hr.

Table metadata is also analyzed when using this discoverer. In particular, the discoverer looks for tables with names containing the following values:
EMAIL, E_MAIL, MAIL.
Also, the discoverer ignores tables with names containing the following values:
FLAG, FLG, STOP, _ID, EMAILID, MAILID, HOLD, DATE, DONOT, DO_NOT, TOWN, STREET, CITY, STATE, DOMICILE, HOUSE_NO, BUILDING, SUBURB, PREMISE, COUNTRY, TYPE, POST, FORMAT
First name A value will be identified as a city name if it exists within the first name picklist. A countryCode can be provided to only look for first names from a specific country (e.g. "DE" for Germany).
First name only This discoverer works the same way as the First name discoverer, but if a value matches as a first name as well as a last name or e-mail, it isn't regarded as a first name. This discoverer is used if you want to eliminate false positives in cases where first names are found in e-mails or are the same as some last names, and you are already looking for those types of values by using their own respective discoverers.
First name and last name together This discoverer is a combination of the First name and Last name discoverers and uses their own respective picklists. However, the value is not regarded as a full name if it also matches as an e-mail.
Last name A value will be identified as a city name if it exists within the last name picklist. A countryCode can be provided to only look for first names from a specific country (e.g. "DE" for Germany).
Last name only This discoverer works the same way as the Last name discoverer, but if a value matches as a last name as well as a first name or e-mail, it isn't regarded as a last name. This discoverer is used if you want to eliminate false positives in cases where last names are found in e-mails or are the same as some first names, and you are already looking for those types of values by using their own respective discoverers.

Switzerland

Name Description
AHV number Values identified as AHV numbers have to be exactly 13 digits long. The first three digits must not have the value "756". The control digit of the AHV number is then calculated and needs to be correct for a value to be identified as an AHV number.

General

Name Description
XML Values are identified as XML values if they are able to be read as one by the discoverer. The XML value has to be able to pass XML validation for it to be properly read by the discoverer.
Keyword Values have to contain one of the supplied keywords to be matched as sensitive data. It is an alternative to the Dictionary discoverer in case of only a few keywords. The keywords are provided as the value as a series of words divided by commas. An example is listed below:
keyword_one,keyword_two,keyword_three
Regular expression Values have to match the regular expression to be matched as sensitive data. The regular expression is provided as the value and has to be a correct regular expression for it to be used.
Dictionary Values have to contain one of the keywords contained in the dictionary to be matched as sensitive data. A dictionary is a file containing keywords in each line. This file is saved in the custom dictionaries folder within the BizDataX Portal folder and its name has to be supplied as the value for the discoverer to use it. It is an alternative to the Keyword discoverer in case of a large number of keywords. An example of a dictionary can be downloaded here