NAV Navbar
  • Getting started
  • Supported scans
  • Installation and deployment
  • Authenticating connectors
  • Running a scan
  • SAR Analytics
  • Custom detectors
  • Scan reports
  • Getting started

    Who is PII Tools for?

    • Businesses looking to quantify the privacy risk inside their file shares, devices and cloud storages, using a single powerful tool.

    • Independent system providers who want to integrate and extend their solution with privacy-related metadata.

    • Service providers and consultants who need to audit what private data lives inside, enters or leaves their client systems, in order to act on it.

    This website documents PII Tools, an AI tool for automated detection and analysis of sensitive and personal data across corporate digital assets.

    We built PII Tools to be:

    1. Comprehensive: scans local and remote storages (file shares, Amazon S3, Office 365, emails, Google Drive, SQL databases…), across all major operating systems (Windows, Linux, OSX), and structured and unstructured file formats (including OCR and scanned images…).
    2. Accurate: applies contextual AI to each document's content to achieve state-of-the-art accuracy and minimize false positives.
    3. Fast with a parallelized architecture to process large repositories quickly.
    4. Easy to deploy and integrate using a turn-key Docker container, accessible either through a simple web interface (for humans) or via a clean REST API (for machines).
    5. Secure and cloud-free: PII Tools runs on your own hardware, either on-premises or on your cloud platform. It needs no internet and calls no 3rd party services to do its scanning.

    PII Tools architecture

    How do I start?

    1. If you are new to PII Tools, start by reading the section on Installation and deployment.

    2. Read Running a scan on how to submit scanning requests to PII Tools through its web interface or REST API.

    3. Scan reports covers how to access and interpret the output PII Tools generates.

    4. If you need support or custom functionality, reach out to PII Tools support.

    Term glossary

    Term Meaning
    Document A digital artefact (file, database table, email…) that may contain personal information. Example: Word, CSV, Excel, PDF, scanned PDF with OCR, JPEG, web server log, Outlook, XML, JSON…
    Storage A repository containing documents to be scanned. Example: file share, Office 365, AWS S3 bucket, SQL database…
    PII Tools server Locally deployed server that performs data discovery scans on documents and storages.
    Connector A software component inside PII Tools that knows how to reads documents from a storage.
    Device Agent An executable file that is run on a file share server or local device. Device agents are thin clients that talk to the PII Tools server.
    Scan The process of automatically detecting personal information. Scans can be either batch or streamed.
    Batch scan A large scan that analyzes an entire storage or device at once, by pulling individual documents from it, using data streaming. Example: scanning an employee laptop; scanning an email archive; scanning an S3 bucket.
    Stream scan Scans a single individual document pushed to the server, returning the scanning results immediately, in real-time. Doesn't access any storages. Example: scanning a piece of submitted text, scanning a single submitted PDF file.
    Inventory index PII Tools creates a detailed index of all detected personal data for each batch scan. From this index, a drill-down report can be generated for easy reviewing and SAR requests.
    Scan report A summary report generated from a particular inventory index. Can be in drill-down HTML format for easy reviews, or in machine-readable JSONL format to answer automated SAR requests.
    Web interface, web UI Users can submit scanning requests and manage scanning results from an integrated (local) web interface.
    REST API Users looking to integrate PII Tools can also submit scans and generate reports by means of HTTPS requests to a PII Tools server.

    Data persistence and security

    Personal data is by definition sensitive — where and for how long does PII Tools store it?

    • For stream scans, no data is ever persisted. The HTTPS request (whether coming from the web UI or the REST API) is immediately executed, personal information detected and sent back as the request response. See Stream scans.

    • For batch scans, as the storage scan progresses, the detected information is being collected and persisted into an internal database called an "inventory index". Each scan gets its own separate inventory index. This inventory index is used to generate inventory reports once the batch scan finishes. To delete an index of a particular batch scan, call the Delete scan index API, or click the corresponding icon in the web UI.

    Anyone authorized to submit scan requests to a PII Tools server can also view all scans and generate scan reports on that server. There is no concept of multi-tenancy within a single PII Tools instance. If needed, use multiple instances to logically separate users and inventory indexes.

    All data is transmitted encrypted using the HTTPS protocol, such as between PII Tools and a remote device or cloud storage to be scanned. Since the PII Tools server is typically deployed on a local IP (without a public domain), it uses a local self-signed SSL certificate to enable HTTPS.

    No data is transmitted or stored outside the PII Tools server, nor are any external services called. Configuration parameters, such as access credentials to remote cloud storages (see Scan configuration), are kept only for the duration of a scan and never persisted.

    Web interface

    In addition to the programmatic access via REST API, PII Tools also offers scanning capabilities through a user-friendly web interface.

    This web interface is installed automatically when you deploy PII Tools, and runs on the same address and port as the server itself (see Deployment).

    For example, if you deployed PII Tools on a machine with IP 195.201.160.29 and port 19873, open your browser and go to https://195.201.160.29:19873/.

    You should see a welcome screen like this:

    web UI welcome

    The web interface allows you to:

    • launch both stream and batch scans
    • generate and download scan reports
    • browse drill-down reports directly on the server (no download needed)
    • track progress of batch scans and remove old scans

    The parameters exposed in the web UI correspond to (a subset of) parameters supported by the REST API. This means all operations that can be performed through the web UI can be also performed using REST, but not necessarily vice versa.

    REST API

    Sample stream scanning request against the PII Tools REST API:

    $ curl -k -s -XPOST https://username:password@127.0.0.1:443/v1/scans/stream/ -H 'Content-Type: application/json' -d '
    {
        "storage_parameters": {
            "content": "'$(base64 -w0 /tmp/bank_form.pdf)'",
            "filename": "bank_form.pdf"
        }
    }
    '
    

    This request will generate a response like this:

    {
      "node_path":"file://bank_form.pdf",
      "node_type":"FILE",
      "total_time_ms":2201,
      "content_type":"application/pdf",
      "file_size":47134,
      "analyzed_size":47134,
      "pii_address_examples":["2201 C Street NW I Washinton, DC \n20520"],
      "pii_address_contexts":["Abdul  \nThe Branch Manager                                               Address: 2201 C Street NW I Washinton, DC \n20520 \nBank of America                                                  Phone No"],
      "pii_address_confidences":[1.0],
      "pii_bank_account_examples":["GL28 0219 2024 5014 48 "],
      "pii_bank_account_contexts":["A/c No. GL28 0219 2024 5014 48"],
      "pii_bank_account_confidences":[1.0],
      "pii_name_examples":["Mustafa Abdul", "Mustafa Abdul"],
      "pii_name_contexts":[
        ", From : Name : Mustafa Abdul The Branch Manager Address :",
        "accordingly . Yours faithfully , Mustafa Abdul Dated : 11th Jan ,"
      ],
      "pii_name_confidences":[1.0, 1.0],
      "pii_address_count":1,
      "pii_bank_account_count":1,
      "pii_name_count":2,
      "pii_types":[
        "address",
        "name",
        "bank_account"
      ],
      "pii_severity":"CRITICAL"
    }
    

    Once PII Tools service is running, users may issue scanning requests using its REST interface. The requests are described in detail in the Running a scan section and can be submitted from any language and environment, using standard libraries and tooling, such as Java, Python or C#.

    PII Tools uses HTTPS with Basic Authentication. Any non-authenticated requests are rejected. The username and password you can edit in docker-compose.yaml (see Deployment section).

    In order to work in local installations, PII Tools uses a self-signed SSL certificate. Configure your HTTPS client to not check the certification authority, such as with curl -k in the examples to the right.

    All REST requests follow the same structure:

    API URL structure

    • Request headers
      • use standard HTTP methods: GET (to retrieve an object), POST (to create), DELETE
      • parameters are always in JSON format (Content-type: application/json)
    • Protocol https://
    • Domain and port of the PII Tools server as configured during Deployment
    • PII Tools API version; currently v1
    • Parameters of the scanning action to take (see scan configuration)

    The REST API responses are in JSON too (Content-type: application/json), and will return an HTTP status according to the success/failure of each operation. PII tools uses a combination of HTTP status codes and descriptive error messages to give you a more complete picture of what has happened with your request.

    For example, if you forgot to supply the scan configuration, a 400 error is returned:

    $ curl -k -XPOST https://username:password@127.0.0.1:443/v1/scans/batch/123
    
    HTTP/1.1 400 BAD REQUEST
    {"error":"Invalid or missing configuration"}
    
    HTTP status Meaning To Retry or Not to Retry?
    2xx Request was successful.
    Example: 200 Success
    4xx A problem with request prevented it from executing successfully. Never automatically retry the request.
    If the error code indicates a problem that can be fixed, fix the problem and retry the request.
    5xx The request was properly formatted, but the operation failed on PII Tools's end. In some scenarios, requests should be automatically retried using exponential backoff.

    Basically, any request that did not succeed will return a 4xx or 5xx error and the JSON response will contain {"error": "<message>"}. The 4xx range means there was a problem with the request, like a missing parameter. The 5xx range means that something went wrong on PII Tools' end.

    Supported scans

    Supported PI types

    The lyrics.txt file is a great litmus test for detection quality. It contains words like "medicine", "sexual" and "healing" used in non-personal context, which will (incorrectly) trigger many rule-based systems. We recommend running this file through any discovery tool you're evaluating and checking the results!

    The following types of personal and sensitive information are supported out of the box:

    Covered data PII types
    Personal full name, home address, face, phone number, date of birth, email, first name, last name, street, city, country
    Financial bank account number, credit card number, routing number
    Sensitive sexual preferences, political views, race, gender, religious views
    Health personal health information (PHI), medical records, WHO ICD codes
    National passport, driving license, SSN, tax ID
    Security username, password, IP address

    You can also define your own detectors dynamically, using custom rules and regular expressions. See Detector configuration.

    Supported storages

    PII Tools can scan files in streamed environments, where you submit individual files or pieces of text to the PII Tools server and get results back in real-time. For this type of scan, you don't need any connectors. See Running a scan.

    In addition, PII Tools can also scan entire storages, pulling documents for analysis in bulk. Here is the full list of out-of-the-box connectors:

    Storage scan_type (documentation) Comment
    Filesystems device Both remote and local file systems are scanned using Device Agents.
    File shares device File shares, SMB and mounted drives are scanned using Device Agents.
    Devices and work stations device Windows, OSX and Linux computers are scanned using Device Agents.
    DropBox device Only locally synced Dropbox folders are supported: use device with root_folder pointed at the DropBox sync directory.
    Amazon S3 s3 Scan AWS S3 buckets.
    Google Drive gdrive Scan Google Drive storages.
    Microsoft SQL Server odbc Scan MSSQL databases, schemas and tables by using the ODBC driver connector and setting "database": "mssql".
    Office 365: Exchange Online mgraph-exchange Scan Microsoft Exchange Online servers and users.
    Office 365: OneDrive mgraph-onedrive Scan Microsoft OneDrive storage.
    Office 365: Sharepoint Online mgraph-sharepoint Scan Microsoft SharePoint sites.
    Office 365: OneNote mgraph-onenote Scan Microsoft OneNote notebooks.
    Microsoft Azure Blob azure-blob Scan Azure Blob storages.

    Supported file formats

    Use the free PII Tools demo to verify how PII Tools will process your files.

    PII Tools understands the difference between structured files (CSV, Excel, JSON, XML…) and unstructured files (PDF, OST/PST, Word, images, OCR, …) and processes those accordingly, approaching "context" differently. This is to maximize accuracy and minimize false alarms.

    PII Tools can also automatically detect structure (columns) in some types of unstructured documents, automatically falling back to structured analysis for embedded tables and lists.

    For some document format conversions and OCR, PII Tools uses the Apache Tika framework internally. You can find the list of all supported file format here.

    Supported severity levels

    Not all personal information is created equal: an IP address in a web server log does not carry the same risk as a spreadsheet full of home addresses and credit card numbers.

    Considering data in context allows PII Tools to assess not only the presence, but also the severity of the detected information. Assigning severity levels to files improves the information filtering and review experience.

    PII Tools reports four severity levels, with the following semantics:

    Severity Description
    NONE No personal data-related risk identified in this file.
    LOW Some potentially identifying information detected, such as an isolated IP address or user name. This personal data is also covered by GDPR, but people typically don’t care to protect this type of data.
    HIGH Detected multiple pieces of information that is reasonably public, such as phone numbers, street addresses or full names. A person would unhappy if this data was made public explicitly.
    CRITICAL Passwords, bank account information, credit card numbers; social security and national ids; health, political and sexual information. Direct risk of identity theft, blackmail, financial damage or loss of job.

    Installation and deployment

    Code examples in this documentation use the curl command to send HTTPS requests. While curl is great for demonstrations, you can of course issue the same requests using your favourite web library, such as requests for Python or Unirest for Java.

    This section describes how to install PII Tools on your own server, whether on-premises or in the cloud. The service is installed centrally, and uses Device Agent connectors to scan remote storages and devices.

    Installation contains

    As part of your purchase, you should have received:

    1. A license agreement plus one or more license keys.
    2. A docker-compose.yml file that bootstraps the deployment of the central PII Tools server.
    3. README.txt file containing the username and password for accessing RARE's private Docker registry.
    4. Device Agent executables for scanning local devices for Windows, OSX and Linux (pii-agent-windows.exe, pii-agent-linux and pii-agent-osx executables).
    5. This documentation.

    Hardware requirements

    The PII Tools service can run on any machine that supports Docker, which includes MacOS, Microsoft Windows 10, Amazon Web Services (AWS), Microsoft Azure, IBM Cloud and Linux. This is because PII Tools is deployed by means of a fully configured, turn-key Docker image.

    The PII Tools server requires at least:

    • 8 CPU cores; 16+ cores recommended for better performance
      • adding more CPU cores improves performance significantly thanks to the parallelized architecture
    • 2 GB of free RAM (8 GB on Windows) plus an additional 1 GB RAM per worker
      • e.g. 16 GB RAM for an 8 core Windows machine
    • 6 GB of free disk space

    In addition, for batch scans:

    • 1 GB of server disk space per 1,000,000 files scanned (approximately 1 kilobyte per file, depending on the amount of PII detected)
    • HTTPS connection between the server and the storage to be scanned (file share, S3 bucket, laptop etc)
    • ability to run an executable binary on the endpoint device to be scanned (workstation, laptop, tablet etc)

    The Device Agents for scanning local devices have no dependencies. They are simple executable files (binary, ".exe" on Windows) that are run on the device to be scanned. They only needs to be able to connect to a running PII Tools server via HTTPS.

    How it works

    PII Tools runs as a local service, deployed on-premises or inside your cloud platform. Once deployed, users submit HTTPS requests to scan resources (individual files or entire storages), after which the server performs the scan and returns the results.

    One deployed PII Tools service can handle an arbitrary number of scans, and keeps running until explicitly terminated. There is no need to re-deploy PII Tools for each individual scan.

    Conceptually, PII Tools consists of three parts:

    1. A central PII Tools server that does the heavy lifting (document format conversions, detect PII with machine learning, worker parallelization). The server doesn't have any direct access to any of your documents.
    2. (for a local device scan) Device Agent, a small executable program you run on the device to be scanned (laptop, desktop, tablet…). It accesses the documents stored there and sends them to the PII Tools server for processing.
    3. (for a remote cloud scan) Connector that knows how to access documents on remote storages (S3, Azure, Office365…). It uses read-only credentials you supplied in the scan configuration to access documents and process them inside the PII Tools server. See supported storages.

    For stream scans, where a file is pushed to the PII Tools server, no Device Agents or Connectors are necessary.

    Deployment

    PII Tools and its dependencies are packaged as a Docker image provided by RARE Technologies from a secure repository (private Docker registry). You install this image locally, using the provided docker-compose.yml file, which creates a fully configured PII Tools service.

    At no point are any scanned documents sent to RARE Technologies or other parties (see also Data persistence and security). The installed service will be fully "local", and does not require nor talk to any external services or tools.

    1. Install Docker on the machine (server) where you wish to host PII Tools. Docker supports MacOS, Microsoft Windows 10, Amazon Web Services (AWS), Microsoft Azure, IBM Cloud, CentOS, Debian, Fedora and Ubuntu.

    2. Install Docker Compose.

    3. Windows and OSX: Increase the RAM and CPU available in Docker Advanced Settings. As a rule of thumb, allow as many cores as possible, and 8 GB of RAM plus extra 1 GB of RAM per core. (This is not needed on Linux servers, where virtualization is more efficient and can use all hardware resources by default.) Daemon parameters

    4. Run docker login registry.rare-technologies.com:5050 --username <USERNAME> --password <PASSWORD> to log into the private Docker registry of RARE Technologies. <USERNAME> and <PASSWORD> were provided to you as part of the purchase in README.txt (see Installation contains). If you authenticated successfully, you'll see a Login Succeeded message in your console.

    5. Optionally, edit the docker-compose.yml file provided to you as part of the purchase. This YAML file contains image names & global configuration of the PII Tools service:

      Editable parameters

      • Set API_USERNAME and API_PASSWORD according to your preferences. These will be the HTTPS credentials used to authenticate all REST API requests.
      • Set NUM_WORKERS to the number of worker processes (CPU cores) you wish to use for parallelization (default: 4; recommended: number of cores of the server you're installing PII Tools on).
      • Change ports section (hosts and ports) to where you want PII Tools to bind to. Defaults:
        • run the REST server on localhost (127.0.0.1:443)
        • run the Device Agent server on localhost (127.0.0.1:1789)

      The defaults are 127.0.0.1 (localhost) for security reasons, but most of the time you'll want PII Tools to bind an external interface, so the server is visible from outside machines. We recommend changing both 127.0.0.1 to 0.0.0.0, which will make PII Tools bind to all available interfaces (IP addresses) of the machine where you install it.

      Unless ports 443 and 1789 are already in use, we recommend keeping them at this default value.

    6. Run docker-compose -f docker-compose.yml up -d. This process may take a while (5-10 minutes), but is only done once, at the PII Tools server installation time.

    To test that the installation was successful and PII Tools is running, run this command:

    $ curl -k -XGET https://username:password@127.0.0.1:443/v1/_status
    

    After which you should see:

    {
        "active_inventory_indexes": 0,
        "total_batch_scans": 0,
        "uptime_days": "0d 0h 3m",
        "host": "127.0.0.1",
        "port": "443"
    }
    

    These six steps above will:

    • Install all dependencies and PII Tools itself.
    • Launch the Analytics web user interface. Access it in your internet browser at https://127.0.0.1:443 by default (see above for how to configure a different host, port, username or password).
    • Launch the PII Tools REST service. Use the example to the right to run your first API request (again, replace the host, port, username and password according to your own config values you set above).

    At this point, the service is running and ready to use. Congratulations!

    new installation screenshot

    Software maintenance

    To check the service status:

    $ curl -k -XGET https://username:password@127.0.0.1:443/v1/_status
    
    {
        "active_inventory_indexes": 0,
        "total_batch_scans": 0,
        "uptime_days": "0d 0h 4m",
        "host": "127.0.0.1",
        "port": "443"
    }
    

    To terminate PII Tools, execute this command on the machine that hosts the PII Tools server:

    $ docker-compose -f docker-compose.yml stop
    
    Stopping pii_tools         ... done
    Stopping inventory_indexes ... done
    

    PII Tools operates as a long-running service and does not require any maintenance.

    To terminate PII Tools, terminate its Docker container using the command to the right.

    To restart the PII Tools service use docker-compose -f docker-compose.yml restart command.

    From time to time, RARE Technologies may release a new version of PII Tools with upgrades and bug fixes. If your license allows for it, this upgrade is made available to you by means of a new Docker image.

    The recommended way to install an upgrade is similar to a restart: tear down the existing PII Tools server as described above, and then use the new image to deploy the upgraded version.

    Authenticating connectors

    Some connectors, such as Office 365, Google Drive or Amazon S3, require authorizing PII Tools in order to generate access credentials needed to scan the data stored inside.

    To streamline the process of authorizing PII Tools and obtaining the necessary credentials, we prepared the step-by-step instructions with screenshots below. But keep in mind that in principle, you can obtain the necessary parameters any other way. These instructions are just a guideline for your convenience. PII Tools only needs the access credentials as input in order to run a scan, no matter where you got them from.

    Microsoft Office 365

    Microsoft Graph is Microsoft's API for accessing data stored on Microsoft Office 365 services, such as Exchange Online, OneDrive, OneNote and SharePoint Online.

    In order for PII Tools to scan data stored on these Microsoft services, you'll need the following access credentials. This section describes how to obtain them in detail:

    • client ID (client_id),
    • client secret (client_secret)
    • tenant ID (tenant_id)

    In a nutshell, PII Tools needs to be registered by an administrator in the Microsoft Application Registration Portal. This creates the client_id and client_secret for PII Tools. Next, the PII Tools need to be authorized to access specific company data by the administrator. The tenant_id is the ID of the organization whose data is to be accessed by PII Tools, i.e. your company.

    Prerequisites

    • An Microsoft Office 365 account with administrator privileges.
    • PII Tools deployed on a server accessible from your local computer. See Deployment. We will refer to this server as https://<pii-tools-server-ip-address-and-port>/ below.

    Registering PII Tools in Application Registration Portal

    For authorizing PII Tools to access your company data, there's a helper tool at https://<pii-tools-server-ip-address-and-port>/mgraph_auth. You will need its URL in the step 4 below.

    1. Go to https://apps.dev.microsoft.com/ and log in as an administrator.

    2. Click on Add an app in the top right corner: add an app

    3. On the Register your application form:

      • Set Application name to "PII Tools".
      • Leave Let us help you get started unchecked.
      • Click on Create. create app
    4. On the PII Tools Registration form:

      • Note the Application Id. This is your client_id.
      • In Application Secrets click on Generate New Password. generate new password
      • Note the generated password. It is displayed only once, so be sure to write it down. This is your client_secret. password
      • In Platforms click on Add Platform. add platform
      • Click on Web. web
      • Set the Redirect URLs to https://<pii-tools-server-ip-address-and-port>/mgraph_auth. redirect url
      • In Microsoft Graph Permissions click on Application Permissions - Add and select the following permissions:
        • Directory.Read.All (required for OneDrive and SharePoint)
        • Files.Read.All (required for OneDrive and SharePoint)
        • Mail.Read.All (required for Exchange)
        • Sites.Read.All (required for OneDrive and SharePoint)
        • User.Read.All (required for Exchange and OneDrive) add app permissions select permissions selected permissions
      • You can select a subset of the permissions if you are not going to use all available connectors. For example, you can exclude Mail.Read.All if you don't want to scan the Exchange Online data.
      • Scroll down to the bottom of the page and click on Save. save

    Authorizing PII Tools to access company data

    1. Open https://<pii-tools-server-ip-address-and-port>/mgraph_auth and enter the client_id and client_secret obtained in the step 4 of the previous section. Click on Submit. enter credentials
    2. You get redirected to the Microsoft Portal. Log in with your administrator account. select admin account
    3. The list of permissions that PII Tools requests is displayed. Click on Accept. list of permissions
    4. You get redirected back to https://<pii-tools-server-ip-address-and-port>/mgraph_auth and a summary of the credentials is displayed. credentials summary

    You are now ready to scan your Microsoft Office 365 data. See Running a scan.

    Re-authorizing PII Tools

    The steps described above are idempotent, meaning the whole process can be safely done repeatedly.

    If you wish to adjust the PII Tools permissions in the future, simply re-authorize PII Tools again:

    1. Go to https://apps.dev.microsoft.com/, open the PII Tools app record, modify the list of permissions, and save.
    2. Re-authorize PII Tools, i.e. repeat the four steps described in the previous section.

    Security notes

    The client_secret is required for PII Tools to authenticate against the Microsoft Graph API and needs to be provided when initializing an Office 365 scan (Exchange, OneDrive, or SharePoint). However, the client_secret is never stored on the PII Tools server itself, for security reasons. It is only sent to the Microsoft Graph Server once, to generate an authentication token for a scan using a secure protocol (HTTPS). If you lose your Office 365 client_secret, PII Tools cannot help you retrieve it.

    Google Drive

    To scan a Google Drive storage, you'll need to obtain the following OAuth credentials:

    • client_id
    • client_secret
    • one of token or refresh_token

    In order to obtain these credentials, you (the admistrator of PII Tools) must take these two steps, explained in more detail below:

    1. register the PII Tools application in the Google APIs
    2. grant the application access to the files to be scanned

    Prerequisites

    • administrator access to the Google account with the GDrive to be scanned.
    • PII Tools deployed on a server accessible from your local computer, See deployment. We will refer to this server as https://<your-pii-tools-domain-name-and-port>/.

    Due to the limitations (security safeguards) in the Google OAuth, it's necessary that your PII Tools instance is accessible through a domain name, such as pii-tools.company.com, i.e. the service is accessible on an URL like https://pii-tools.company.com:4443/ and not only an IP address like https://10.0.0.1:4443/.

    For your convenience, we also host the same Google Drive authentication service at https://demo.pii-tools.com/gdrive_auth, which is a server controlled by us, RARE Technologies. You may use that service for obtaining access tokens for doing a Google Drive scan in case your own PII Tools server doesn't have a domain name.

    Registering PII Tools in Google APIs

    1. Go to https://console.developers.google.com/apis/dashboard. Click on Select a project in the top bar and then click on NEW PROJECT. enable services
    2. Enter "PII Tools" as the Project Name. You can leave any other fields at the defaults. Then click on CREATE. create project
    3. Next, go to your dashboard and click on ENABLE APIS AND SERVICES. enable apis
    4. Search for "Google Drive API", open it and click on ENABLE. enable gdrive api
    5. Select Credentials in the sidebard and click on CREATE CREDENTIALS. create credentials
    6. First, go to the OAuth consent screen tab, set the Product name shown to users to "PII Tools", leave any other fields at the default values, and click on Save. oauth consent
    7. Go to the Credentials tab, click on Create credentials and select OAuth client ID. oauth client id
    8. Set Application type to "Web application", set the Name to "PII Tools" and add https://<your-pii-tools-server-domain-name-and-port>/gdrive_auth (or https://demo.pii-tols.com/gdrive_auth) to the Authorized redirect URIs. Click on Create. create app
    9. Your client_id and client_secret will be displayed. Be sure to note these down. You'll need them in the next part. client ID and secret

    Granting access and obtaining access tokens

    1. Go to https://<your-pii-tools-server-domain-name-and-port>/gdrive_auth (or https://demo.pii-tols.com/gdrive_auth — see the note above), enter your client_id and client_secret, and click Submit. gdrive_auth
    2. You will be redirect to Google. Select the Google account in which you want to access the data (it may or may not be the same account with which you registered the PII Tools application in the previous section). Then click Allow in the confirmation dialogue. consent
    3. You get redirect back to https://<your-pii-tools-server-domain-name-and-port>/gdrive_auth and all your credentials are displayed, including the token and refresh_token. credentials
    4. You can now execute the Google Drive scan either using the REST API or using the web user interface at https://<your-pii-tools-server-domain-name-and-port>/configure_batch_scan/gdrive. run scan in UI

    Security notes

    The client_id, client_secret, token and refresh_token are required for PII Tools to authenticate against the Google API and must be provided when you launch a Google Drive scan. They are never stored on the PII Tools server itself, for security reasons. Consequently, you must provide them every time you run a scan. If you lose your Google Drive credentials, PII Tools cannot help you retrieve them, and you must re-generate them again using the above steps.

    Microsoft Azure Blob

    To scan an Azure Blob storage, the account_name and account_key are needed.

    In order to obtain these credentials:

    1. Log into the Azure Portal.
    2. Choose Storage accounts in the sidebar menu and then select the blob storage to be scanned. select blob storage
    3. Choose Access keys from the left hand side sub-menu. Find your account_name under Storage account name and your account_key under key1: Key. locate credentials

    Device Agents

    Device agents (DAs) are thin clients that scan a filesystem (PC, Windows, MacOSX, Linux, laptop, file shares…). Each DA runs locally as a small executable (single .exe file), and communicate with a running PII Tools server over the network. One PII Tools server can be associated with many DAs.

    Device agents are long-running processes that can be used for a single scan, or repurposed across multiple scans or scheduled repeat scans.

    Installing DA

    To install a DA, copy the appropriate binary for your operating system (Windows, Linux, OSX) to the machine you want to scan. These device agent binaries were provided to you as a part of your purchase.

    Running DA

    1. On the machine you want to scan, double click the DA executable file (for example, pii-agent-windows.exe for Windows users). This will launch the device agent.
    2. The agent will automatically open a new window in your default browser allowing you to configure it. If the window does not open automatically, copy&paste the link shown in the pii-agent-windows.exe window manually in your internet browser:

    device agent console

    1. On the configuration page in your browser, fill in these three fields and then press Submit:
      • Base folder - only allow scans inside this folder. Example: C:/Users. Attempts to scan locations outside this folder on this machine will fail. Example: scanning D:/ will not be allowed if the Base folder is C:/.
      • Hostname of Device Agent server - hostname (IP address) of the main PII Tools server (for example 24.53.168.9). This PII Tools server must be reachable from the machine running the DA.
      • Port of Device Agent server - port of the Device Agent server. 1789 by default. See Deployment.

    device agent configuration page

    Press Submit. If you configured the agent correctly, you'll see a page that contains token, which you will use when submitting scans against this Device agent:

    device agent success

    If something goes wrong or the agent cannot connect to the server, you'll see an error page. In this case, you should fix the error and try again.

    device agent fail

    Running DA scans

    Run scans against the PII Tools server as described in Scan configuration.

    You can have multiple device agents associated with a single PII Tools server, or even with a single device. Use the token created above to identify which device agent you want to scan.

    Stopping DA

    To terminate the device agent running on this machine, simply close the executable (e.g. pii-tools-windows.exe, click X in the top right corner) and the browser window. Nothing else is necessary.

    device agent close

    If you close the DA window while a scan is running, the scan will be interrupted.

    After terminating the Device agent, no more scans will be possible against this machine. To reenable scans on this device, you must follow the above steps and create a new token.

    Running a scan

    Scanning documents for sensitive and personal data is the main functionality of PII Tools. This section contains information on how scans work and how to configure and process scanning requests using a REST API.

    To run a scan using the web interface, click the "Launch new scan" button in the top-right corner of the "Analytics" tab, and follow the instructions in the right-hand side panel.

    new scan screenshot

    You launch a new scan by submitting its parameters to the /scans/batch or /scans/stream endpoint, or clicking the corresponding buttons in the web interface.

    A scan configuration defines what is to be scanned (input), using what PII detectors, and what to do with the results (output): see Scan configuration.

    Multiple scans can be submitted to a single PII Tools instance, even at the same time, concurrently. Each scan gets its own ID which you may use to check the scanning progress and retrieve the scanning report at the end.

    Conceptually, PII Tools supports two types of scans:

    1. A batch scan, which runs asynchronously in pull mode, actively fetching documents from the storage to be scanned (local directory, remote S3 bucket, email archive, database…). Instances of discovered personal data from each document are stored within an inventory index, from which a scan report is generated once the scan is complete.

    2. An stream scan, which runs in push mode, accepting a single document or piece of text on input. Stream scan is synchronous and returns any discovered personal data right away, in real-time. With strean scanning, no data is stored locally within PII Tools.

    crawler_pool

    Once a scan is launched, PII Tools immediately starts running its detectors on the input data (see Scan configuration below on how to configure the scan parameters). The scanning is parallelized for performance, using a distributed pool of workers as configured during deployment. In this way, multiple files are being analyzed at any one time, using multiple PII detectors.

    Scan configuration

    A scan configuration is a JSON object that defines what is to be scanned (input), using what detectors, and what to do with the results (output).

    In its simplest form, without any of the optional parameters, a full configuration for a stream scan looks like this:

    {
        "storage_parameters": {
            "content": "Contents of notes.txt, in base64 encoding.",
            "filename": "notes.txt"
        }
    }
    

    And for a local device scan:

    {
        "scan_type": "device",
        "storage_parameters": {
            "token": "24539"
        },
        "root_folder": "C:/Users"
    }
    

    And for a remote cloud scan:

    {
        "scan_type": "s3",
        "storage_parameters": {
            "aws_secret_access_key": "--== AWS_SECREST_ACCESS_KEY ==--",
            "aws_access_key_id": "--== AWS_ACCESS_KEY_ID ==--",
            "bucket": "BUCKET_NAME"
        },
        "root_folder": "some_folder_in_bucket"
    }
    

    And for a Microsoft SQL Server scan:

    {
        "scan_type": "odbc",
        "storage_parameters": {
            "server": "pii-test.database.windows.net:1433",
            "database": "mssql",
            "username": "user",
            "password": "pwd"
        }
        "root_folder": "database_name"
    }
    

    Input configuration

    Example input configuration for a batch scan, scanning all files in the S3 bucket acme_backups under /backups/2018 while ignoring files ending in txt, doc or docx:

    {
        "scan_type": "s3",
        "storage_parameters": {
            "aws_access_key_id": "AKIA1234567890123456",
            "aws_secret_access_key": "abCD1234567/qB6",
            "bucket": "acme_backups"
        },
        "root_folder": "/backups/2018",
        "reject_filenames": ".*(txt|doc|docx)$"
    }
    

    Example input configuration for a local device scan of C:\Users, with the Device Agent running on localhost and accepting only ZIP files:

    {
        "scan_type": "device",
        "storage_parameters": {
            "token": "34588"
        },
        "root_folder": "C:/Users/",
        "accept_filenames": ".*(zip)$"
    }
    

    This is the list of parameters that define the scanning input:

    Parameter Type Description Default
    scan_type String Type of storage to scan (see below).
    storage_parameters Object Access credentials for the particular storage type.
    root_folder String (optional) Only scan files under this directory, or tables in this database. storage root
    detectors String (optional) List of detector names to use in this scan. If empty, use all available detectors.
    reject_filenames String (optional) Skip all files whose filename (including path) matches this regular expression. Case insensitive. ^$
    accept_filenames String (optional) Skip all files whose filename (including path) doesn't match this regular expression. Case insensitive. .*
    download_max_bytes Integer (optional) Download at most this many bytes from file. 5000000 (5 mB)
    analyze_max_text Integer (optional) Analyze at most this many characters from extracted plain text per file. 10000 (10 kB)
    analyze_max_rows Integer (optional) Analyze at most this many rows from tables (in spreadsheets, databases etc). 100
    analyze_max_columns Integer (optional) Analyze at most this many columns from tables (in spreadsheets, databases etc). 100
    pdf_resolution Integer (optional) Resolution for processing image PDFs 50
    pdf_max_pages Integer (optional) Process at most this many pages from image PDFs 5

    Only scan_type and storage_parameters are mandatory.

    Root folder

    The root_folder parameter is interpreted differently based on the type of scan:

    1. For file storage scans (s3, gdrive, device, sftp etc): only scan files under this directory.
    2. For database scans (odbc, mssql etc):
      • "root_folder": "" (default): Scan all tables under all databases.
      • "root_folder": "database_name": Scan all tables under a specific database.
      • "root_folder": "database_name/table_name": Scan tables named table_name under a specific database.
      • "root_folder": "database_name/schema_name/table_name": Scan the specified table under the specific schema and database.
    3. For Microsoft Office 365 scans (mgraph-*), see the documentation of the particular scan types below.

    See Supported Storage Connectors for the full list of supported storage connectors.

    Specifying which detectors to use

    Example: launch an AWS S3 scan, using only the face, password and name detectors:

    curl -k -XPOST https://username:password@127.0.0.1:443/v1/scans/batch/s3_scan -H 'Content-Type: application/json' -d'
    {
        "scan_type": "s3",
        "storage_parameters": {
            "aws_secret_access_key": "AKIA1234567890123456",
            "aws_access_key_id": "abCD1234567/qB6",
            "bucket": "contract_backups"
        },
        "detectors": ["face", "password", "name"]
    }'
    

    To specify which detectors to use in a batch scan, define the "detectors": ["name_1", "name_2"] parameter in the scan configuration. The available names can be retrieved via GET /v1/detectors (see list all existing detectors GET endpoint).

    Scan type stream

    storage_parameters Type Description
    content String raw base64-encoded document content
    filename (optional) String file name

    Scan type device

    storage_parameters Type Description
    token String Token for the Device Agent to scan. See Device agents.

    See Device Agents for how to install agents and scan local and remote filesystems and file shares.

    Scan type s3

    storage_parameters Type Description
    bucket String S3 bucket to scan.
    aws_access_key_id String AWS access key ID for the bucket.
    aws_secret_access_key String AWS secret for the bucket.

    Scan type gdrive

    Scan files in Google Drive storage. Please see Authenticating connectors for how to obtain the credentials.

    storage_parameters Type Description
    client_id String Client ID.
    client_secret String Client secret key.
    token String Access token.
    refresh_token String Refresh token.

    The root_folder has to be set either to:

    • root to scan the entire storage, or
    • folder ID to scan the contents of particular folder.

    The folder ID can be retrieved from the URL where the folder can be accessed in Google Drive by taking the string after the last forward slash. For example, in https://drive.google.com/drive/u/2/folders/1bzcnvs3UCr9t_yWvWYcPSUXGrMna9F79, the folder ID is 1bzcnvs3UCr9t_yWvWYcPSUXGrMna9F79.

    Scan type odbc

    storage_parameters Type Description
    server String Host and port where the database server is running.
    database String Name of DBMS.
    username String Username for SQL Server.
    password String Password for the specified username.

    Supported database types:

    • mssql: SQL Server 17 and Azure SQL.

    To be able to connect to MS SQL Server databases, you may need to allow remote access to the IP address where PII Tools Server is running. For example, on Azure, this can be done via the Azure portal:

    mssql_azure

    Scan type azure-blob

    Scan files in Microsoft Azure Blob storage. Please see Authenticating connectors for how to obtain the necessary credentials.

    storage_parameters Type Description
    account_name String Account name for a particular Azure Blob storage.
    account_key String Secret key for the account.
    container String (optional) Container to be scanned. If not specified, all containers in the storage will be scanned.

    The root_folder can optionally be set to a prefix within the container. The root_folder value is ignored when scanning all containers (i.e., when container is not specified).

    Scan type mgraph-exchange

    Scan emails in Microsoft Exchange Online. Please see Authenticating connectors for how to obtain the credentials.

    storage_parameters Type Description
    client_id String PII Tools client ID
    client_secret String PII Tools client secret
    tenant_id String Organization's tenant ID

    The root_folder can optionally be set to a user's email address. That way only the emails for one specific user will be scanned. Otherwise, the emails for all users will be scanned.

    Scan type mgraph-onedrive

    Scan emails in Microsoft OneDrive. Please see Authenticating connectors for how to obtain the credentials.

    storage_parameters Type Description
    client_id String PII Tools client ID
    client_secret String PII Tools client secret
    tenant_id String Organization's tenant ID

    The root_folder can be one of the following:

    • users - scan drives for all users
    • users/{user-principal-name} - scan drives for a single user
    • groups - scan drives for all user groups
    • groups/{group-name} - scan drives for groups with the given name
    • sites/{site-identifier} - scan drives for a given site

    root_folder examples:

    • users/honza@raretechnologies.onmicrosoft.com
    • groups/PII tools data/
    • sites/raretechnologies.sharepoint.com:/sites/PIIToolsSite

    When scanning an entire site, the site URL translates to site-identifier as follows:

    site-identifier = {site-host}:/{site-relative-path}

    For example, the site URL https://raretechnologies.sharepoint.com/sites/PIIToolsSite corresponds to root_folder = sites/raretechnologies.sharepoint.com:/sites/PIIToolsSite (mind the colon after the host name).

    Scan type mgraph-sharepoint

    Scan lists in a Microsoft Sharepoint site. Please see Authenticating connectors for how to obtain the credentials.

    storage_parameters Type Description
    client_id String PII Tools client ID
    client_secret String PII Tools client secret
    tenant_id String Organization's tenant ID

    The root_folder must be set to the site-identifier of the Sharepoint site to be scanned.

    The site URL translates to site-identifier as follows:

    site-identifier = {site-host}:/{site-relative-path}

    For example, the site URL https://raretechnologies.sharepoint.com/sites/PIIToolsSite corresponds to root_folder = raretechnologies.sharepoint.com:/sites/PIIToolsSite (mind the colon after the host name).

    Note that Sharepoint site files and notebooks can be accessed using the mgraph-onedrive and mgraph-onenote connectors. Set the root_folder to sites/{site-identifier}.

    Scan type mgraph-onenote

    Scan OneNote notebooks in MS Office 365. Please see Authenticating connectors for how to obtain the credentials.

    storage_parameters Type Description
    client_id String PII Tools client ID
    client_secret String PII Tools client secret
    tenant_id String Organization's tenant ID

    The root_folder can be one of the following:

    • users - scan notebooks for all users
    • users/{user-principal-name} - scan notebooks for a single user
    • groups - scan notebooks for all user groups
    • groups/{group-name} - scan notebooks for groups with the given name
    • sites/{site-identifier} - scan notebooks for a given site

    root_folder examples:

    • users/honza@raretechnologies.onmicrosoft.com
    • groups/PII tools data/
    • sites/raretechnologies.sharepoint.com:/sites/PIIToolsSite

    When scanning an entire site, the site URL translates to site-identifier as follows:

    site-identifier = {site-host}:/{site-relative-path}

    For example, the site URL https://raretechnologies.sharepoint.com/sites/PIIToolsSite corresponds to root_folder = sites/raretechnologies.sharepoint.com:/sites/PIIToolsSite (mind the colon after the host name).

    Detector configuration

    Parameter Type Description Default
    custom_detectors Object (optional) Custom detectors to use during this scan. {}

    The custom_detectors field contains a mapping between detector names (any string, e.g. my_detector) and their JSON configurations. See Custom detectors for request examples.

    Output configuration

    Example output configuration, indexing additional metadata alongside the scan:

    {
        "metadata": {
            "location": "Bristol offices",
            "client": "Acme Corporation",
            "scanned_by": "John DPO"
        }
    }
    

    Output configuration allows you to index additional metadata, identifying the resource, location or purpose of the scan. You can add arbitrary key-value pairs here, to help you organize and manage scans.

    Parameter Type Description Default
    metadata Object (optional) Additional metadata to index alongside the scan. Will become a part of the inventory report {}
    max_examples Integer Maximum number of PII examples to index per file and one PII type. 10

    Batch scans

    Batch scans are long-running scans against an entire folder, device or storage (database, cloud document storage). The API endpoints below show how to launch a scan, track its progress and generate a report for finished scans.

    Internally, each running batch scan indexes the detected information into a database, called "inventory index". See also Data persistence and security.

    Once the scan has completed, you can request its results in a zipped HTML format (see HTML reports) or a more computer-friendly structured dump of all indexed inventory records (see Inventory dump).

    Launch batch scan

    Launch a batch scan of S3 bucket contract_backups under the scan id s3_contracts_march2018, against a PII Tools server that's running on 127.0.0.1:

    $ curl -k -XPOST https://username:password@127.0.0.1:443/v1/scans/batch/s3_contracts_march2018 -H 'Content-Type: application/json' -d'
    {
        "scan_type": "s3",
        "storage_parameters": {
            "aws_secret_access_key": "AKIA1234567890123456",
            "aws_access_key_id": "abCD1234567/qB6",
            "bucket": "contract_backups"
        }
    }'
    

    And this is the expected response:

    {"success":true}
    

    POST /scans/batch/{scan_id}

    Launch a batch scan, using the provided scan configuration. Runs asynchronously. The request will return immediately; see Batch status for checking the scan progress.

    scan_id is the identifier (name) of this scan. The name must: - be less than 100 characters long - consist only of lowercase letters (a-z), digits (0-9), underscore _ and hyphen -. - not start with _ or -

    Batch status

    Check the progress status of the scan with id s3_contracts_march2018:

    $ curl -k -XGET https://username:password@127.0.0.1:443/v1/scans/s3_contracts_march2018
    

    Request response:

    {
        "files_scanned": 312688,
        "files_skipped": 0,
        "files_failed": 5,
        "tables_scanned": 0,
        "tables_failed": 0,
        "hours_elapsed": 7.31,
        "status": "FINISHED",
        "error": ""
    }
    

    GET /scans/batch/{scan_id}

    Query for status of a batch scan with the given scan ID.

    Returns

    Parameter Type Description
    files_scanned Integer Number of successfully scanned files.
    files_skipped Integer Number of files for which the scanning was skipped. This can happen for binary files when the file size is too large (over download_max_bytes) AND the analysis cannot be done on a partially downloaded content only. An example would be a large JPEG image.
    files_failed Integer Number of files for which the scanning failed.
    tables_scanned Integer Number of successfully scanned tables.
    tables_failed Integer Number of tables, for which the scanning has failed.
    hours_elapsed Float How many hours has the scan been running so far.
    status String One of "SCANNING", "AGGREGATING", "TERMINATING", "FAILED", "FINISHED" (see below).
    error String Error message. Only available if status is "FAILED".

    Status reference

    • SCANNING - Input data is being scanned for PII.
    • AGGREGATING - Scanning has finished and the results are being aggregated. This takes roughly one minute per 100,000 objects scanned.
    • TERMINATING - User has requested the scan to be terminated. Scanning will finish for already discovered objects. This should take less than a minute. After that the state switches to AGGREGATING.
    • FAILED - The scan has finished with a failure. The error field contains a detailed error message. Note that scans manually terminated by the user are considered FAILED.
    • FINISHED - The scan has finished successfully.

    Download report

    Download the .zip archive with HTML report for the scan with id s3_contracts_march2018. You should extract report_s3_contracts_march2018.zip and open report_s3_contracts_march2018/index.html in your browser:

    $ curl -k -XGET https://username:password@127.0.0.1:443/v1/scans/report/s3_contracts_march2018 -H 'Content-Type: application/json' -d' { "output": "html"} ' -OJ
    

    Same thing but download in JSON format. The report gets saved as report_s3_contracts_march2018.jsonl in the current directory:

    $ curl -k -XGET https://username:password@127.0.0.1:443/v1/scans/report/s3_contracts_march2018 -H 'Content-Type: application/json' -d' { "output": "json"} ' -OJ
    

    GET /scans/{scan_id}

    Once the scan completes, you may download its results in two formats:

    Parameter Value Description
    output html Interactive drill-down report; see HTML reports
    output json Raw JSON dump of all indexed records; see Inventory dump

    The response will be a ZIP archive of either HTML pages or a single JSON file, with the ZIP name equal to the scan ID.

    Terminate scan

    Terminate the batch scan with ID s3_contracts_march2018:

    $ curl -k -XPOST https://username:password@127.0.0.1:443/v1/scans/terminate/s3_contracts_march2018
    

    POST /scans/terminate/{scan_id}

    Request termination of a running scan.

    This should take less than a minute as the scan terminates cleanly, during which the scan will be in the TERMINATING state (see status reference). After that the scanning results get aggregated, which takes roughly one minute per 100,000 objects scanned. Therefore, it may take a few minutes for the scan to terminate following the termination request. The final status of a terminated scan is set to FAILED.

    Delete scan

    Delete all data for the batch scan with ID s3_contracts_march2018:

    $ curl -k -XDELETE https://username:password@127.0.0.1:443/v1/scans/s3_contracts_march2018
    

    DELETE /scans/{scan_id}

    Once you don't need the results of a scan any more, it is recommended to delete it in order to get rid of the persisted sensitive data and free up disk space.

    List all scans

    To list the IDs all existing batch scans (inventory indexes):

    $ curl -k -XGET https://username:password@127.0.0.1:443/v1/scans/batch
    

    Request response:

    ["s3_contracts_march2018", "s3_contracts_april2018", "s3_contracts_march2018"]
    

    GET /scans/batch

    List IDs of all existing inventory indexes.

    You can use those IDs to delete indexes of those scans you no longer need (see Delete inventory index).

    Stream scans

    Example of running a stream scan:

    $ curl -k -s -XPOST https://username:password@127.0.0.1:443/v1/scans/stream/ -H 'Content-Type: application/json' -d '
    {
        "storage_parameters": {
            "content": "'$(base64 -w0 /tmp/bank_form.pdf)'",
            "filename": "bank_form.pdf"
        }
    }
    '
    

    And this is the expected response (on a single JSON line; re-formatted here for clarity):

    {
      "node_path":"file://bank_form.pdf",
      "node_type":"FILE",
      "total_time_ms":2201,
      "content_type":"application/pdf",
      "file_size":47134,
      "analyzed_size":47134,
      "pii_address_examples":["2201 C Street NW I Washinton, DC \n20520"],
      "pii_address_contexts":["Abdul  \nThe Branch Manager                                               Address: 2201 C Street NW I Washinton, DC \n20520 \nBank of America                                                  Phone No"],
      "pii_address_confidences":[1.0],
      "pii_bank_account_examples":["GL28 0219 2024 5014 48 "],
      "pii_bank_account_contexts":["A/c No. GL28 0219 2024 5014 48"],
      "pii_bank_account_confidences":[1.0],
      "pii_name_examples":["Mustafa Abdul", "Mustafa Abdul"],
      "pii_name_contexts":[
        ", From : Name : Mustafa Abdul The Branch Manager Address :",
        "accordingly . Yours faithfully , Mustafa Abdul Dated : 11th Jan ,"
      ],
      "pii_name_confidences":[1.0, 1.0],
      "pii_address_count":1,
      "pii_bank_account_count":1,
      "pii_name_count":2,
      "pii_types":[
        "address",
        "name",
        "bank_account"
      ],
      "pii_severity":"CRITICAL"
    }
    

    POST /scans/stream/

    Scan a given file and return the detected PII right away.

    To run a stream scan, specify scan_type as stream and include the raw file content under "storage_parameters: content", encoded as base64. For more parameters, see Scan configuration.

    Unlike a batch scan, the request will block until the response is ready (synchronous). In case the file to be scanned is large, it may be easier to scan it using the asynchronous batch scan, to avoid timeouts.

    Returns

    Response in JSONL (JSON lines) format. Each line represents all detected metadata for one object.

    In case the input was a single file, there will be only one line in the response. In case the input was an archive, with multiple files, there will be multiple JSON lines, one for each of the archive's "inner file".

    The returned metadata fields are:

    • "pii_types": <Array[str]> – a set of all detected PII types
    • "pii_severity": <str> – overall severity of the detected PII (see Severity)
    • "file_size": <str> – total file size in bytes
    • "analyzed_size": <int> – number of bytes analyzed within this file
    • "content_type": <str> – automatically detected content type
    • "node_path": <str>: – path of node (here, filename)
    • "node_type": <str>: – type of node (FILE or DIRECTORY)
    • "pii_*_count": <int> – how many PII instances of this PII type were detected in this file?
    • "pii_*_examples": <Array[str]> – a list of detected PII examples of each PII type
    • "pii_*_confidences": <Array[int]> – a list of confidence scores for each detection
    • "pii_*_contexts": <Array[str]> – a list of contexts for each detected PII instance (for easier reviewing)
    • "*_errors": <Array[str]> – a list of errors encountered, if any, while processing this file
    • "total_time_ms": <int> – time taken to process this file

    An asterisk * in the above table represents a placeholder for a concrete PI type: the response will contain pii_name_examples, pii_name_contexts, pii_phone_examples, pii_password_contexts etc.

    Relaunch a scan

    Retrieve information from an existing scan with id s3_scan:

    curl -k -XGET https://username:password@127.0.0.1:443/v1/scans/s3_scan
    

    Use the response to pre-populate a new scan:

    {
      "config": {
        "scan_type": "s3",
        "root_folder": "",
        "connector": {
          "aws_access_key_id": "AKIAITS4CCYPGOIJSLEA",
          "aws_secret_access_key": "1234abcd",
          "bucket": "pii-tools-public",
          "region": "eu-central-1"
        },
        "detectors": [
          "email",
          "face",
          "first_name",
          "last_name",
          "city",
          
        ],
        "processor": {
          
        }
      },
      "error": "",
      "files_failed": 14,
      "files_scanned": 994,
      "files_skipped": 6,
      "status": "FINISHED",
      "tables_failed": 0,
      "tables_scanned": 0,
      "time_elapsed": "0d 0h 4m 20s"
    }
    

    For convenience, PII Tools supports "relaunching" a scan. This enables you to launch a new scan with the exact same parameters as an existing scan.

    When using the web interface, click the "Relaunch scan" icon. This icon is in the "Actions" column next to each existing scan.

    relaunch scan screenshot

    Endpoint:

    GET https://username:password@127.0.0.1:443/v1/scans/{scan_id}

    To achieve this functionality using the REST API, first retrieve the config of an existing scan from the backend. See the example request to the right.

    The relevant parameters can be read from the config field in the response. Use these parameters to pre-populate values for a new scan.

    SAR Analytics

    PII Tools indexes all discovered metadata from finished scans internally and allows you to search, filter and export selected records.

    This is especially useful for collecting information in order to answer Subject Access Requests (SAR), and for identifying affected and high-risk files in general.

    analytics screenshot

    The analytics is comprised of three levels, each with a distinct purpose:

    • Level 1: List and filter existing scans.
    • Level 2: Find matching objects (files) by detected personal data, file location, severity.
    • Level 3: View detected personal data of a sigle object.

    Each level can be conveniently accessed via the web interface, using the search bar on top on the "Analytics" tab, or programmatically with the REST API.

    Analytics Dashboard

    To use Analytics from the PII Tools web dashboard, simply open the page at https://username:password@127.0.0.1:443. Replace the username, pasword, host and port according to your installation.

    All analytics functionality is accessible through the "Analytics" tab in the left menu.

    When you log in, you'll see a page that lists all scans. In case you have many scans, use pagination at the bottom to navigate between page. Or use the search bar on top and enter "Scan ID" to look for specific scans.

    For example, click the search bar on top, select Scan ID from the drop-down menu, and type fileshare + ENTER to list all scans whose name contains fileshare.

    level 1 screenshot

    To list all objects that contain a specific personal information (such as a person name, phone, address, bank account etc), select the corresponding type you want to match in the drop-down menu, and then type the value you wish to search.

    Some types also support querying by the count of detected instances. For example, to find all objects that contain at least three addresses, click on the Search bar on top and select PII, Personal, Home Address, >, type 2 and press ENTER.

    Similarly, to "search for objects that contain a credit card number": PII, Financial, Credit card number and EXISTS.

    Use the special "Anything" option in the drop-down to search in everything. This is a convenient way to locate the desired information in any metadata field.

    Analytics REST API

    The Analytics API can be used to search over scans and return matching objects. Each response includes the matched results as a list of JSON objects, and may also (optionally) include aggregate statistics over all the matched results.

    Endpoint

    POST /analytics

    Run a search and return matched objects. Optionally also return aggregate statistics.

    Note that method is POST (not GET), because the parameter payload can be potentially large and we don't want huge URLs.

    Input (JSON)

    Field Type Description Example
    query String Concrete query using the query language described below "SELECT * WHERE name = 3"
    output String Output format: either json or jsonl. "json"
    aggregate Boolean If true - return aggregate statistics over query result false
    (internal) _debug Boolean If true - retrieve internal query information false

    Returns

    A list of all matched objects in output format.

    Each returned object contains several fields:

    Field Type Description Example
    scan_id String scan_id where this object appear my_gdrive_scan
    node_type String Type of object ARCHIVE
    node_id String ID of object /a/b/file.txt
    path String Full path to object gdrive://<1hy>/r/AW_resume2010.pdf
    data JSON All metadata about this object
    aggregation_stat JSON (if "aggregate": true) Aggregation statistic
    (internal) original_query JSON (if "_debug": true) Original query (parsed)
    (internal) scan_ids List<String> (if "_debug": true) List of resolved scan ids (without *) ["london_1", "london_2"]
    (internal) limits List<Integer> (if "_debug": true) Limit & Offset values [20, 0]
    (internal) translated_query JSON (if "_debug": true) Translated query (ES)
    (internal) sort String (if "_debug": true) Column for sorting pii_passport_count
    (internal) timings Dict[String, Float] (if "_debug": true) Time of queries (scan & aggregation) "timings": {"agg": 0.7665057182312012, "scan": 0.41928648948669434}

    Query language

    PII Tools accepts a special SQL-like language to query the inventory index database. The result of queries are either a list of matching scans (level 1) or matching files (level 2), including pagination and sorting.

    Some examples to illustrate the main idea (full query language definition below):

    • "Give me all objects from all scans that contain any passport data":
      • FROM * WHERE passport EXISTS (result = a list of all matching files, as JSON metadata objects)
    • "Give me scans than contain "London" in `scanid`_:
      • FROM *London* (result = a list of scans; note the missing WHERE clause)
    • "Give me all scans":
      • FROM *

    The formal grammar is in Extended Backus-Naur Form (EBNF):

    start: FROM from_body [WHERE filter_condition] [SORT sort_body] [LIMIT limit_body]
    
    from_body: scan_id ("," scan_id)*
    scan_id: [a-zA-Z0-9_*-]+
    
    filter_condition: term (OR term)*
    term: factor (AND factor)*
    factor: attr_cond | "(" filter_condition ")"
    attr_cond: attr CMP value | attr EXISTS
    attr: [a-zA-Z0-9_-]+
    value: \'.+?\' | SIGNED_NUMBER
    
    limit_body: UNUMBER [OFFSET offset_body]
    offset_body: UNUMBER
    sort_body: attr
    
    EXISTS: "EXISTS"
    LIMIT: "LIMIT"
    OFFSET: "OFFSET"
    FROM: "FROM"
    WHERE: "WHERE"
    SORT: "SORT"
    AND: "AND"
    OR: "OR"
    CMP: "=" | "<" | "<=" | ">" | ">=" | "CONTAINS"
    WILDCARD: "*"
    UNUMBER: /[0-9]+/
    

    In practical terms, this looks like:

    FROM some_scans [WHERE (attr OP [value])+] [SORT attr] [LIMIT int_value [OFFSET int_value]]
    

    Let's look at the individual query parts with examples:

    FROM section

    Which scans should we consider for the matching? This can be:

    • One scan_id: FROM my_scan ...
      • A wildcard * that means that you want to query over all available scans: FROM * ....
      • A wildcard as part of scan_id: FROM *london* .... This means "search in all scans that contain london in their `scanid`"_
    • Several scan_id separated by a comma: FROM my_scan_1, my_scan_2 ...

    WHERE section

    Query part responsible for filtering. You should use this to choose objects according to metadata criteria.

    You can filter using any available metadata, see possible attributes in the inventory dump

    We support these operators: - common: (, ), OR, AND, EXISTS - numeric: =, >, < - string: CONTAINS

    Examples of the WHERE part: - Only consider objects that contain the string Jeremy anywhere in the detected names: - WHERE name CONTAINS 'Jeremy' - Only consider objects that contain a passport or SSN number: - WHERE password EXISTS OR ssn EXISTS - Objects with CRITICAL severity: - WHERE severity CONTAINS 'CRITICAL'

    SORT section

    Part responsible for ordering the results. You can specify any attribute for sorting:

    This part is very straightforward: - Sort by severity: SORT severity

    If you don't specify any field for sorting, severity, uid will be used by default.

    LIMIT [OFFSET] section

    The query part responsible for pagination. OFFSET starts counting at 0.

    Always specify both LIMIT and OFFSET with each query! If you don't limit the result size, you'll get all results at once, which can be huge (overwhelm the server and browser)!

    Examples: - Give me the first object: LIMIT 1 OFFSET 0 - Give me the second object: LIMIT 1 OFFSET 1 - Give me the first 20 objects: LIMIT 20 OFFSET 0 - Give me objects from 50 to 99: LIMIT 50 OFFSET 50

    Query statistics

    Example request:

    curl -k -XPOST https://username:password@127.0.0.1:443/v1/analytics -H 'Content-Type: application/json' -d' {"output": "json", "query": "FROM * LIMIT 10 OFFSET 0", "aggregate": true} '
    

    Example response:

    {
      "aggregation_stat": {
        "category": {
          "Financial": 26,
          "Health": 27,
          "National ID": 3,
          "Personal": 3382,
          "Security ID": 91,
          "Sensitive": 12
        },
        "doctype": {
          "application/gzip": 21,
          "application/json": 11,
          "application/json-lines": 2,
          "application/mbox": 6,
          "application/msword": 15,
          "application/octet-stream": 8,
          "application/pdf": 192,
          "application/rar": 1,
          "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet": 3,
          "application/x-bzip2": 2,
          "application/x-rar": 2,
          "application/x-tar": 11,
          "application/xml": 15,
          "application/zip": 9,
          "image/jpeg": 555,
          "image/png": 15,
          "text/csv": 173,
          "text/plain": 1693,
          "text/xml": 6,
          "unknown/unknown": 79
        },
        "language": {
          "null": 2819
        },
        "number_of_scans": 10,
        "objects_size_gb": 0.08937763143330812,
        "severity": {
          "CRITICAL": 91,
          "HIGH": 925,
          "LOW": 622,
          "NONE": 1181
        },
        "storage": {
          "gdrive": 1805,
          "s3": 1014
        },
        "total_objects": 2819
      },
      "result": [
         
      ]
    }
    

    If you set "_aggregate": true, the response will also contain the aggregated statistics for the query.

    In the frontend, these statistics are used to render the facet bar to the left of the results panel. This is the coloured panel with information on "PII Severity", "PII Category", "Language" etc.

    stats screenshot

    Examples

    Level 1: Scan queries

    Sample Level 1 request: get 10 scans.

    curl -k -XPOST https://username:password@127.0.0.1:443/v1/analytics -H 'Content-Type: application/json' -d' {"output": "json", "query": "FROM * LIMIT 10 OFFSET 0"} '
    

    Sample Level 1 response:

    {
      "result": [
        {
          "browse_url": "/generate_browsable_report/gd_8_fl",
          "download_url": "/_download_report/gd_8_fl",
          "error": "",
          "files_failed": 16,
          "files_scanned": 877,
          "files_skipped": 6,
          "objects_failed": 16,
          "objects_per_hour": 6770,
          "objects_scanned": 877,
          "objects_skipped": 6,
          "remove_scan_url": "/_remove_scan/gd_8_fl",
          "scan_id": "gd_8_fl",
          "scan_type": "gdrive",
          "start_time": "2018-10-24 12:24:31 ",
          "status": "FINISHED",
          "tables_failed": 0,
          "tables_scanned": 0,
          "terminate_scan_url": "/_terminate_scan/gd_8_fl",
          "time_elapsed": "0d 0h 7m 58s"
        },
        {
          "browse_url": "/generate_browsable_report/gd_7",
          "download_url": "/_download_report/gd_7",
          "error": "",
          "files_failed": 0,
          "files_scanned": 1,
          "files_skipped": 0,
          "objects_failed": 0,
          "objects_per_hour": 7,
          "objects_scanned": 1,
          "objects_skipped": 0,
          "remove_scan_url": "/_remove_scan/gd_7",
          "scan_id": "gd_7",
          "scan_type": "gdrive",
          "start_time": "2018-10-24 12:24:13 ",
          "status": "FINISHED",
          "tables_failed": 0,
          "tables_scanned": 0,
          "terminate_scan_url": "/_terminate_scan/gd_7",
          "time_elapsed": "0d 0h 8m 15s"
        },
        
      ]
    }
    

    level 1 screenshot

    Use Level 1 queries to list matching scans. See Query language for all parameters.

    Level 2: Object queries

    Sample Level 2 query: From all scans give me the first two objects that contain the name Jack

    curl -k -XPOST https://username:password@127.0.0.1:443/v1/analytics/ -H 'Content-Type: application/json' -d "{\"query\": \"FROM * WHERE name CONTAINS 'Jack' LIMIT 2 OFFSET 0\", \"output\": \"json\"}"
    

    Sample Level 2 response:

    {
      "result": [
        {
          "data": {
            "analyzed_size": 6668,
            "children_time_ms": 4534,
            "content_type": "application/x-tar",
            "file_size": 6668,
            "pii_address_contexts": [
              "Evans, Bush-Cheney Recount Fund,\n       301 Congress Ave.  Suite 200, Austin TX 78701; and\n (3) fax to  Lou",
              "express to:\n Donald L. Evans\n 301 Congress Avenue, Suite 200\n Austin, TX 78701\n\nPlease do not send more",
              "Abreo\nAndrew Kalotay Associates, Inc.\n61 Broadway, Ste 2205\nNew York NY 10006\nPhone: (212) 482 0900\nFax",
              "the invoice to my attention 121 SW Salmon St. Portland OR 97214.  \n\nPlease put on the reference"
            ],
            "pii_address_count": 4,
            "pii_address_examples": [
              "301 Congress Ave.  Suite 200, Austin TX 78701",
              "301 Congress Avenue, Suite 200\n Austin, TX 78701",
              "61 Broadway, Ste 2205\nNew York NY 10006",
              "121 SW Salmon St. Portland OR 97214"
            ],
            "pii_email_contexts": [
              "oding: 7bit\nX-From: cordia@cordia.com\nX-To: \"Kenneth L. L",
              "inski@enron.com\nTo: datren.williams@enron.com, shirley.crenshaw@e",
              "0 -0800 (PST)\nFrom: alan.comnes@enron.com\nTo: rbw@mrwassoc.co",
              "3 -0700 (PDT)\nFrom: hrgim@enron.com\nTo: e.taylor@enron.",
              "From: Buyers Guide <saveresults@abgmail.activeresearch.com>\nX-To: mark.taylor@",
              "=2\n\nYour user ID is mark.taylor@enron.com\nYour password is ob",
              "c: smara@enron.com, jeff.dasovich@enron.com\nMime-Version: 1.0\nC",
              "hrgim@enron.com\nTo: e.taylor@enron.com\nSubject: RE: login ",
              "From: Buyers Guide <saveresults@abgmail.activeresearch.com>\nX-To: mark.taylor@",
              "iveresearch.com\nTo: mark.taylor@enron.com\nSubject: DVD Player"
            ],
            "pii_email_count": 40,
            "pii_email_examples": [
              "cordia@cordia.com",
              "datren.williams@enron.com",
              "alan.comnes@enron.com",
              "hrgim@enron.com",
              "saveresults@abgmail.activeresearch.com",
              "mark.taylor@enron.com",
              "jeff.dasovich@enron.com",
              "e.taylor@enron.com",
              "saveresults@abgmail.activeresearch.com",
              "mark.taylor@enron.com"
            ],
            "pii_name_contexts": [
              "letter from Bush/Cheney Campaign Chairman Don Evans and a response form .",
              "Sara , Greg Whalley and Andy Zipper just called me and are",
              "on 11/15/2000 11:28 AM ----- Stephanie Sever 11/15/2000 11:25 AM To :",
              "04:44:12 PM To : &quot; Vincent Kaminski &quot; cc : Subject :",
              "my login and password . Michael Taylor Coal and Emissions Trading Enron",
              "contribution , you may contact Jack Oliver or Jeanne Johnson Phillips at",
              "and Associates URGENT From : Lou Cordia Date : November 13 ,",
              ". What Gore/Lieberman Campaign Chairman Richard Daley is orchestrating is as outrageous",
              "trouble logging on , give Sheri Thomas a call -- she has",
              "in stealing the election from John Kennedy . Point blank , Al"
            ],
            "pii_name_count": 34,
            "pii_name_examples": [
              "Don Evans",
              "Andy Zipper",
              "Stephanie Sever",
              "Vincent Kaminski",
              "Michael Taylor",
              "Jack Oliver",
              "Lou Cordia",
              "Richard Daley",
              "Sheri Thomas",
              "John Kennedy"
            ],
            "pii_password_contexts": [
              "established for the following users:\n\nSara Shackelton\nUser ID: SSHACKLE\npassword: enron4\n\nMarcus Nettelton\nUser ID: MNETTEL\nPassword: enron5\n\nPlease note these",
              "User ID: SSHACKLE\npassword: enron4\n\nMarcus Nettelton\nUser ID: MNETTEL\nPassword: enron5\n\nPlease note these are case sensitive.\n\nLet me know if",
              "2\n\nYour user ID is mark.taylor@enron.com\nYour password is oberon\n\nFor more great decision guides, visit this link:\nhttp://www",
              "2\n\nYour user ID is mark.taylor@enron.com\nYour password is oberon\n\nFor more great decision guides, visit this link:\nhttp://www",
              "MTAYLOR5 (Non-Privileged).pst\n\nUsername: michael.e.taylor@enron.com\nPassword: 01131979\n\n -----Original Message-----\nFrom: \tTaylor, Michael E  \nSent:\tWednesday, September 19"
            ],
            "pii_password_count": 5,
            "pii_password_examples": [
              "password: enron4",
              "Password: enron5",
              "password is oberon",
              "password is oberon",
              "Password: 01131979"
            ],
            "pii_phone_contexts": [
              "3) fax to  Lou Cordia (703/212-9128) a copy of each signed",
              "New York NY 10006\nPhone: (212) 482 0900\nFax: (212) 482 0529\nemail",
              "Phone: (212) 482 0900\nFax: (212) 482 0529\nemail: leslie.abreo@kalotay.com"
            ],
            "pii_phone_count": 3,
            "pii_phone_examples": [
              "(703/212-9128",
              "(212) 482 0900",
              "(212) 482 0529"
            ],
            "pii_severity": "CRITICAL",
            "pii_types": [
              "password",
              "address",
              "name",
              "phone",
              "email"
            ]
          },
          "node_id": "/enron_emails.tar.gz",
          "node_type": "ARCHIVE",
          "object_id": 4384,
          "path": "s3://pii-tools//small/test/enron_emails.tar.gz",
          "scan_id": "s3_small_test"
        },
        
      ]
    }
    

    Use Level 2 requests to retrieve a list of all objects (files, SQL tables, emails) that match a specific query.

    level 2 screenshot

    Level 3: Object metadata queries

    Sample Level 3 request: metadata for object_id=67 from my_scan:

    curl -k -XGET https://username:password@127.0.0.1:443/v1/get_object/my_scan/67
    

    Sample Level 3 response:

    {
      "data": {
        "analyzed_size": 6668,
        "children_time_ms": 5748,
        "content_type": "application/gzip",
        "file_size": 6668,
        "pii_address_contexts": [
          "the invoice to my attention 121 SW Salmon St. Portland OR 97214.  \n\nPlease put on the reference",
          "Abreo\nAndrew Kalotay Associates, Inc.\n61 Broadway, Ste 2205\nNew York NY 10006\nPhone: (212) 482 0900\nFax",
          "Evans, Bush-Cheney Recount Fund,\n       301 Congress Ave.  Suite 200, Austin TX 78701; and\n (3) fax to  Lou",
          "express to:\n Donald L. Evans\n 301 Congress Avenue, Suite 200\n Austin, TX 78701\n\nPlease do not send more"
        ],
        "pii_address_count": 4,
        "pii_address_examples": [
          "121 SW Salmon St. Portland OR 97214",
          "61 Broadway, Ste 2205\nNew York NY 10006",
          "301 Congress Ave.  Suite 200, Austin TX 78701",
          "301 Congress Avenue, Suite 200\n Austin, TX 78701"
        ],
        "pii_email_contexts": [
          "-\n\n\n\"Leslie Abreo\" <leslie.abreo@kalotay.com> on 08/08/2000 04:4",
          "\"Vincent Kaminski\" <vkamins@enron.com>\ncc:  \nSubject: FAS",
          "12) 482 0529\nemail: leslie.abreo@kalotay.com\n\nVisit AKA's websit",
          "login in\n\nUsername: michael.e.taylor@enron.com\nPassword: 01131979\n",
          "=2\n\nYour user ID is mark.taylor@enron.com\nYour password is ob"
        ],
        "pii_email_count": 5,
        "pii_email_examples": [
          "leslie.abreo@kalotay.com",
          "vkamins@enron.com",
          "leslie.abreo@kalotay.com",
          "michael.e.taylor@enron.com",
          "mark.taylor@enron.com"
        ],
        "pii_name_contexts": [
          "AC0S-4USV7S . Thank you , Alan Comnes Enron Corp FAS 133 Hedge",
          "04:44:12 PM To : &quot; Vincent Kaminski &quot; cc : Subject :",
          "sent earlier . Regards , Leslie Abreo Andrew Kalotay Associates , Inc.",
          "----- From : Taylor , Michael E Sent : Wednesday , September",
          "my login and password . Michael Taylor Coal and Emissions Trading Enron",
          "and Associates URGENT From : Lou Cordia Date : November 13 ,",
          ". What Gore/Lieberman Campaign Chairman Richard Daley is orchestrating is as outrageous",
          "in stealing the election from John Kennedy . Point blank , Al",
          "Kennedy . Point blank , Al Gore is not as honorable as",
          "is not as honorable as Richard Nixon was in 1960 and Gerald"
        ],
        "pii_name_count": 18,
        "pii_name_examples": [
          "Alan Comnes",
          "Vincent Kaminski",
          "Leslie Abreo",
          "Michael E",
          "Michael Taylor",
          "Lou Cordia",
          "Richard Daley",
          "John Kennedy",
          "Al Gore",
          "Richard Nixon"
        ],
        "pii_password_confidences": [
          1.0,
          1.0
        ],
        "pii_password_contexts": [
          "ppt\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nRE: login in\n\nUsername: michael.e.taylor@enron.com\nPassword: 01131979\n\n -----Original Message-----\nFrom: \tTaylor, Michael E  \nSent:\tWednesday, September 19",
          "2\n\nYour user ID is mark.taylor@enron.com\nYour password is oberon\n\nFor more great decision guides, visit this link:\nhttp://www"
        ],
        "pii_password_count": 2,
        "pii_password_examples": [
          "Password: 01131979",
          "password is oberon"
        ],
        "pii_phone_contexts": [
          "New York NY 10006\nPhone: (212) 482 0900\nFax: (212) 482 0529\nemail",
          "Phone: (212) 482 0900\nFax: (212) 482 0529\nemail: leslie.abreo@kalotay.com",
          "3) fax to  Lou Cordia (703/212-9128) a copy of each signed"
        ],
        "pii_phone_count": 3,
        "pii_phone_examples": [
          "(212) 482 0900",
          "(212) 482 0529",
          "(703/212-9128"
        ],
        "pii_severity": "CRITICAL",
        "pii_types": [
          "password",
          "address",
          "name",
          "phone",
          "email"
        ]
      },
      "node_id": "/enron_emails.tar.gz",
      "node_type": "FILE",
      "object_id": 67,
      "path": "gdrive://<1hyLzYfWLl9T9xIt9uSiprnKY5NGxN6It>/test/enron_emails.tar.gz",
      "scan_id": "my_scan"
    }
    

    Level 3 requests retrieve all available information = metadata on a single object.

    The objects (files, tables, emails…) are uniquely identified by their scan_id + object_id. See Level 2 queries for how to get ids of matching objects.

    level 3 screenshot

    API endpoint:

    GET /v1/get_object/<scan_id>/<object_id>

    Retrieve full metadata for the given file, uniquely identified by its scan id + object id.

    Input

    Field Type Description
    scan_id String Scan identifier
    object_id String Object identifier

    Output:

    Object metadata with status 200 if all OK, or {"error": "error text"} and a corresponding HTTP status in case of failure.

    Custom detectors

    You can define your own custom patterns to discover with each scan.

    Examples of custom patterns include organization-specific information such as "student ID" or "social tax number". These patterns are called custom detectors, and when matched, will appear in the scanning results alongside the other out-of-the-box detectors.

    Unlike the built-in detectors that use machine learning, the custom detectors are simpler, using regular expressions to define what to match ("instance regexp") and what must appear nearby the instance for the match to be valid ("context regexp").

    In the web interface, use the "Custom detectors" tab in the left menu. For adding/deleting custom detectors programmatically, see the REST API endpoint documentation below.

    custom detector screenshot

    Example of a custom PII detector for a 9-digit student id:

    "student_id": {
        "instance_regexps": ["\\bID[0-9]{6}\\b"],
        "context_regexps": ["student"],
        "severity": "LOW",
        "ignore_case": true
    }
    

    How it works

    1. Each custom detector is run alongside the standard out-of-the-box detectors on the text of each scanned object. Images are ignored and do not affect custom detectors.

    2. When a potential PII candidate instance is found matching any of the instance_regexps rules, its context (surrounding text, column headers) is checked using the context_regexps rules. Unless at least one of context_regexps matches, the candidate is discarded.

    3. If a candidate instance passes the context check, this PII instance is indexed just like any other PI, and will appear in the Scan report. The severity you provided (e.g. LOW in the example above) will be combined with the severity of other PIs detected in this object, to assign the final severity for the entire object.

    Custom detector parameters

    Parameter Type Description Default
    instance_regexps List of String Candidate PIs must match at least one regexp in this list. - (mandatory parameter)
    context_regexps List of String Candidate contexts must match at least one regexp in this list. No context checking if empty. []
    severity String Severity level to assign to each hit. One of LOW, HIGH, CRITICAL. -
    ignore_case Boolean Ignore text upper/lower case when matching. One of true, false. true

    Add a custom detector

    Add a new detector named my_detector:

    curl -k -XPUT https://username:password@127.0.0.1:443/v1/detectors -H 'Content-Type: application/json' -d'
    {
        "student_id": {
            "instance_regexps": ["\\bID-[0-9]{6}\\b"],
            "context_regexps": ["student"],
            "severity": "LOW"},
            "ignore_case": true
        }
    }'
    

    You can define new custom detectors using either the web interface, or programmatically using the REST API.

    API endpoint

    PUT /v1/detectors

    Output:

    {"success" : true} if all went OK, or {"error": "error text"} otherwise.

    You enter custom detectors during scan configuration, using the optional detectors field.

    See the example to the right for a REST API example. This detector will look for words like ID-0123456 inside any file. The pattern is ID- followed by 6 digits, and delimited by word boundaries from either side, so that words like PID-01234567 won't match.

    In addition, we require the word student must appear nearby, otherwise the match is discarded. Note that we didn't put the word boundary around student here, so that words like "student", "students", "student's" etc will pass the context check too.

    Since we defined ignore_case to be True, letter casing is ignored. Both id- and ID- or Id- will match, and any of Student, STUDENTS etc will pass the context check.

    List all existing detectors

    Get a list of all available detectors:

    curl -k -XGET https://username:password@127.0.0.1:443/v1/detectors
    

    Response:

    {
      "custom": [
        {
          "name": "employee_id",
          "ignore_case": true,
          "instance_regexps": [
            "[A-z][A-z][0-9][0-9][0-9][A-z][A-z]"
          ],
          "context_regexps": [
            "\\bperson\\b"
          ],
          "severity": "LOW"
        }
      ],
      "regular": [
        {
          "category": "Personal",
          "name": "email",
          "severity": "LOW"
        },
        {
          "category": "Personal",
          "name": "face",
          "severity": "LOW"
        },
        
      ]
    }
    

    Endpoint

    GET /v1/detectors

    Output

    Field Type Description
    regular JSON List of built-in detectors
    custom JSON List of user-defined custom detectors

    Delete a custom detector

    Delete the custom detector with id my_detector:

    curl -k -XDELETE -u frontend:a151Gc@188f6A2428F9a0f71a8E19ae9708 https://h2.rare-technologies.com:53481/v1/detectors -H 'Content-Type: application/json' -d'{"name": ["my_detector"]}'
    
    {
      "success": true
    }
    

    Endpoint

    DELETE /v1/detectors

    Input

    Field Type Description
    name List List of detector names that you want to remove

    Output: JSON with {"success" : true} if all OK, or {"error": "error text"} if something goes wrong.

    Scan reports

    Scanning results can be accessed in two ways:

    1. An interactive HTML drill-down report, meant to be reviewed by humans.
    2. A structured dump of all detected personal records, meant to be processed by computers and DSAR requests.

    Both types of report can be downloaded using the Download report API, or the corresponding web UI buttons in the "Actions" menu under each scan.

    In addition, the web user interface also allows you to browse the drill-down reports online, directly in the web interface, without having to download the zipped HTML files locally.

    web UI inventory

    HTML report

    HTML reports from batch scans are drill-down web pages at three successively finer levels of resolution:

    1. Summary page (index.html)
      • Summarizes overall PII statistics by file type (PDF, CSV, archive etc), PII type and Severity.
    2. Listing page
      • Files and directories that match search criteria, grouped by location.
      • Filter by severity, file type and PII type.
      • Listing is a table that provides metadata about the matching file: file name, location, size, file type, severity, PII types.
    3. File page
      • Details about the PII detected in a particular file, with PII instances highlighted in context.

    The report can be downloaded as a ZIP page archive using the Download report API.

    Summary Report

    Inventory dump

    The inventory dump is in the JSON lines format, a plaintext format where each line contains personal information about one scanned object (file or directory).

    To access the detected information in a raw structured computer-friendly way, without the HTML formatting and summaries, download its inventory dump using Download report API.

    The JSON output format (schema) is the same for both stream and batch scans:

    • information about one file or directory per line; see below for the record schema
    • each FILE record contains metadata about the file scanned, such as the time taken to process the file, severity of PII detected in the file content, and examples of PII detected
    • DIRECTORY records contain a summary of all personal metadata found inside any files in that directory
    • DIRECTORY records do not contain the content_type, analyzed_size and file_size fields, only FILE records do.

    Example of one JSON line (reformatted for easier reading):

    {
        "pii_severity": "LOW",
        "pii_types": ["face"],
        "content_type": "image/jpeg",
        "analyzed_size": 6400,
        "file_size": 6400,
        "node_path": "file://test//profile-pic.jpeg",
        "node_type": "FILE",
        "pii_face_confidences": [1.0],
        "pii_face_contexts": [null],
        "pii_face_count": 1,
        "pii_face_examples": [[59, 51, 112, 112]],
        "total_time_ms": 153
    }
    

    JSON record schema (see an example to the right):

    • "pii_types": <Array[str]> – a set of all detected PII types
    • "pii_severity": <str> – overall severity of the detected PII (see Severity)
    • "pii_*_count": <int> – how many PII instances of this PII type were detected in this file? (see below for all possible returned types that replace *)
    • "pii_*_examples": <Array[str]> – a list of detected PII examples of each PII type
    • "pii_*_confidences": <Array[int]> – a list of confidence scores for each detection
    • "pii_*_contexts": <Array[str]> – a list of contexts for each detected PII instance (for easier reviewing)
    • "file_size": <str> – total file size in bytes
    • "analyzed_size": <int> – number of bytes analyzed within this file
    • "content_type": <str> – automatically detected content type
    • "node_path": <str>: – path of node (here, filename)
    • "node_type": <str>: – type of node (FILE or DIRECTORY)
    • "*_errors": <Array[str]> – a list of errors encountered while processing this file
    • "total_time_ms": <int> – time taken to process this file

    Available PII types

    These are the concrete PII types we detect. You can substitute the value in the PII Type column for the * placeholder in the table above.

    PII Category PII Type Example instance Note
    Financial credit_card 3547011095740842
    Financial bank_account RS39 2712 7251 5923 5161 28
    Financial routing_number 111000012
    Sensitive race Asian
    Sensitive gender Female
    Sensitive religious_views about consciousness are generally shunned as psudo-scientific heretics by the hard science community. Conciousness is a meta-physical or philosophical concept.</p>\n\n<p>"I think, therefore I am." is the only proof that consciousness exists that I am aware of. Therefore, you cannot even prove that a person other', "a program that simulates the results of consciousness?</p>\n\n<p>I don't believe that you can program conscious AI, nor could you prove that you have done so. Consciousness isn't something that can ever be marketed. You can only market the AI on the basis of it's
    Sensitive sexual_preference It's only recently that I've come out to myself as being bisexual and learning to not just tolerate it but honor it.
    Personal name Sean Connery Full name
    Personal address San Raton, California 99109 Full address
    Personal face [59, 51, 112, 112] Face bounding box coordinates
    Personal date_of_birth 1962
    Personal phone 408.555.1296
    Personal email john.arnold@enron.com
    Personal street 1930 Second St Only available for structure data (CSV, XLS, SQL etc)
    Personal city Adams Only available for structured data (CSV, XLS, SQL etc)
    Personal country USA Only available for structured data (CSV, XLS, SQL etc)
    Personal country_code SN Only available for structured data (CSV, XLS, SQL etc)
    Personal first_name Garth Only available for structured data (CSV, XLS, SQL etc)
    Personal last_name Stofko Only available for structured data (CSV, XLS, SQL etc)
    Medical health Patient Information Name: Monica Latte Patient ID: 0000-44444 Birth Date: 04/04/1950 Gender: Female Marital Status: Divorced Problems: DIABETES MELLITUS (ICD-250.) HYPERTENSION, BENIGN ESSENTIAL (ICD-401.1) Medications: PRINIVIL TABS 20 MG (LISINOPRIL) 1 po qd Last Refill: #30 x 2 : Carl Savem MD (08/27/2010) HUMULIN INJ 70/30 (INSULIN REG & ISOPHANE (HUMAN)) 20 units ac breakfast Last Refill: #600 u x 0 : Carl Savem MD
    Medical health_id 1234-123-123-AZ
    Medical icd G44.311 World Health Organization ICD codes (version 9, 10, 11)
    Security ip 25.27.159.60
    Security username UserID: MNETTEL
    Security password password: enron4
    National driving_licence 609-53-5588
    National passport CX2345678
    National tax_id 988-88-8889 National Tax ID or equivalent
    National ssn 296-12-3298 Social security number or equivalent