Discoverers

Discoverers are used to find specific values or types of values within the chosen data source, environment, schema and/or table and identify them as sensitive data. They are grouped by country or usage, but there are several discoverers that have general uses when more specific discoverers are not what we need.

Some discoverers make use of picklists, a large collection of first names, company names, cities, etc. which are used to identify certain types of sensitive data. Some discoverers make use of the database metadata (e.g. the names of the columns) in finding columns that contain sensitive data.

Table of contents
Croatia
Financial Data
General
Geographical Data
Human Data
Other
Switzerland
USA

Croatia

Discoverers in this category are used to identify sensitive data specific to Croatia.

Tax identification (JMBG)

Values identified as JMBGs have to be exactly 13 digits long. This data can be found anywhere in the string, i.e. the value will be found even if it has other characters before and/or after the JMBG. If a given value is 12 digits long, a zero is added to the start of the value. The first two digits have to be smaller than 32, but cannot have the value "00". The control digit of the JMBG is then calculated and needs to be correct for a value to be identified as an JMBG.

Table metadata is also analyzed when using this discoverer. In particular, the discoverer looks for tables with names containing the following values:
JMBG

Bank account

Values identified as bank account numbers have to be exactly 10 digits long. The control digit of the bank account number is then calculated and needs to be correct for a value to be identified as a bank account number.

BBAN

Values identifies as BBANs have to be exactly 17 digits long. The first seven digits represent the bank number which has to match with one of the values in the bank picklist. If the bank number is correct, the other ten digits are then checked with the Bank account discoverer. If the check is passed, the value is identified as a BBAN.

Drivers license

Values identified as drivers license numbers have to be exactly eight digits long. The control digit of the drivers license number is then calculated and needs to be correct for a value to be identified as a drivers license number.

Health insurance number

Values identified as health insurance numbers have to be exactly nine digits long. This data can be found anywhere in the string, i.e. the value will be found even if it has other characters before and/or after the health insurance number. The control digit of the health insurance number is then calculated and needs to be correct for a value to be identified as a health insurance number.

Tax identification (Matični broj)

Values identified as matični broj have to be between six and eight digits long. This data can be found anywhere in the string, i.e. the value will be found even if it has other characters before and/or after the matični broj. If a given value is shorter than eight digits, zeros are added to the start of the value until the number of digits is eight. The control digit of the matični broj is then calculated and needs to be correct for a value to be identified as a matični broj.

Phone number

Values identified as phone numbers have to match with one of two phone number formats. The first one describes phone numbers in Croatia and values written in this format have a correct phone number prefix, a correct area or network number and have to end with a number which has to be between six and seven digits long. Examples of phone numbers of this format are listed below:

+887566
+404865123
+99123455634234

The second format describes phone numbers within the European Union and values written in this format have a correct phone number prefix followed by a number which has to be between 1 and 14 digits long. Examples of phone numbers of this format are listed below:

01123456
38523111222
+3858001234567

Table metadata is also analyzed when using this discoverer. In particular, the discoverer looks for tables with names containing the following values:
PHONE, MOB, TEL

Tax identification (OIB)

Values identified as OIBs have to be exactly 11 digits long. This data can be found anywhere in the string, i.e. the value will be found even if it has other characters before and/or after the OIB. The control digit of the OIB is then calculated and needs to be correct for a value to be identified as a OIB.

Table metadata is also analyzed when using this discoverer. In particular, the discoverer looks for tables with names containing the following values:
OIB, TAX_N, TAXN

Financial data

Discoverers in this category are used to find values in the database that correspond to general financial data that isn't specific to a certain country.

Credit card number

Values identified as credit card numbers have to be between 13 and 19 digits long. This data can be found anywhere in the string, i.e. the value will be found even if it has other characters before and/or after the credit card number. Credit card numbers must not start with digits "00" or "99".

Table metadata is also analyzed when using this discoverer. In particular, the discoverer looks for tables with names containing the following values: CCARD, CREDITC, CREDIT_CARD, CARD, KARTIC

Business Identifier Code (BIC/SWIFT)

Values identified as BIC/SWIFT are those that correspond to the following formats:

BANKCC11
BANKCC11XXX

The first six characters have to be uppercase letters followed by two characters that are either an uppercase letter or a digit. If the last tree characters are present, they have to be uppercase letters or digits.

IBAN

Values identified as IBANs have to be longer than three characters. The first two characters represent the country code and need to be uppercase letters (values between "AA" and "ZZ"). The control digit of the IBAN is then calculated and needs to be correct for a value to be identified as an IBAN.

Table metadata is also analyzed when using this discoverer. In particular, the discoverer looks for tables with names containing the following values:
IBAN

Other

Discoverers in this category are those that don't belong in any of the other categories.

Company

A value will be identified as a company name if it exists within the company picklist. A countryCode can be provided to only look for companies from a specific country (e.g. "DE" for Germany).

Geographical Data

Discoverers in this category are used to identify geographical data (city names, etc.).

City

A value will be identified as a city name if it exists within the place picklist. A countryCode can be provided to only look for cities from a specific country (e.g. "DE" for Germany).

Table metadata is also analyzed when using this discoverer. In particular, the discoverer looks for tables with names containing the following values:
CITY

USA

Discoverers in this category are used to identify sensitive data specific to the United States of America.

USA Social Security Number

Values identified as USA social security numbers have to be either nine digits long, or eleven digits long if they contain hyphens (which have to be located in the fourth and seventh place within the value). USA social security numbers are divided into three separate values that all have their own criteria:

  • The first three digits must not be equal to the following values: "666", "900", "999", "000"
  • The third and fourth digit must not be equal to "00"
  • The last four digits must not be equal to "0000"

The entire USA social security number must not be equal to the following values: "078051120", "219099999" All digits that form a USA social security number must not be the same, and values must not be incrementing in regards to the previous digits (e.g. value "123456789" is not allowed).

Human Data

Discoverers in this category are used to identify sensitive data pertaining to most people, regardless of country (e.g. first name or e-mail).

Email

Values identified as e-mails need to comform to the correct e-mail format. Examples of various correct e-mails are listed below:

john117@mail.com
john_doe@mail.com
john.doe@mail.com.hr
john.doe@mail-archive.com
john.doe_117@mail-archive.com.hr

Table metadata is also analyzed when using this discoverer. In particular, the discoverer looks for tables with names containing the following values:
EMAIL, E_MAIL, MAIL

Also, the discoverer ignores tables with names containing the following values: FLAG, FLG, STOP, _ID, EMAILID, MAILID, HOLD, DATE, DONOT, DO_NOT, TOWN, STREET, CITY, STATE, DOMICILE, HOUSE_NO, BUILDING, SUBURB, PREMISE, COUNTRY, TYPE, POST, FORMAT

First name

A value will be identified as a city name if it exists within the first name picklist. A countryCode can be provided to only look for first names from a specific country (e.g. "DE" for Germany).

First name only

This discoverer works the same way as the First name discoverer, but if a value matches as a first name as well as a last name or e-mail, it isn't regarded as a first name. This discoverer is used if you want to eliminate false positives in cases where first names are found in e-mails or are the same as some last names, and you are already looking for those types of values by using their own respective discoverers.

First name and last name together

This discoverer is a combination of the First name and Last name discoverers and uses their own respective picklists. However, the value is not regarded as a full name if it also matches as an e-mail.

Last name

A value will be identified as a city name if it exists within the last name picklist. A countryCode can be provided to only look for first names from a specific country (e.g. "DE" for Germany).

Last name only

This discoverer works the same way as the Last name discoverer, but if a value matches as a last name as well as a first name or e-mail, it isn't regarded as a last name. This discoverer is used if you want to eliminate false positives in cases where last names are found in e-mails or are the same as some first names, and you are already looking for those types of values by using their own respective discoverers.

Switzerland

AHV number

Values identified as AHV numbers have to be exactly 13 digits long. The first three digits must not have the value "756". The control digit of the AHV number is then calculated and needs to be correct for a value to be identified as an AHV number.

General

XML

Values are identified as XML values if they are able to be read as one by the discoverer. The XML value has to be able to pass XML validation for it to be properly read by the discoverer.

Keyword

Values have to contain one of the supplied keywords to be matched as sensitive data. It is an alternative to the Dictionary discoverer in case of only a few keywords. The keywords are provided as the value as a series of words divided by commas. An example is listed below:

keyword_one,keyword_two,keyword_three

Regular expression

Values have to match the regular expression to be matched as sensitive data. The regular expression is provided as the value and has to be a correct regular expression for it to be used.

Dictionary

Values have to contain one of the keywords contained in the dictionary to be matched as sensitive data. A dictionary is a file containing keywords in each line. This file is saved in the custom dictionaries folder within the BizDataX Portal folder and its name has to be supplied as the value for the discoverer to use it. It is an alternative to the Keyword discoverer in case of a large number of keywords. An example of a dictionary can be downloaded here