NAV Navbar
  • Getting started
  • Supported scans
  • Installation and deployment
  • Authenticating connectors
  • Device Agents
  • Running a scan
  • Scheduler
  • SAR Analytics
  • Custom detectors
  • Exclusions
  • Export / import
  • Scan reports
  • Available PII types
  • Getting started

    Who is PII Tools for?

    • CISO, InfoSec, Security, Legal & Privacy teams, who need to quantify privacy risk inside endpoints, emails, file shares, databases and cloud storages.

    • MSPs, service providers and consultants who need to audit customer data and manage breach incidents.

    • Data management platforms to enhance their solution with our powerful AI technology for PII discovery and redaction.

    This website documents PII Tools, an AI solution for automated discovery and analysis of sensitive and personal data across corporate digital assets.

    We built PII Tools to be:

    1. Secure. PII Tools runs on your hardware, either on-prem or in your cloud. Data never leaves your environment, doesn't call any 3rd parties, can run air-gapped.
    2. Accurate: Actionable results with unmatched accuracy, thanks to PII Tools' proprietary AI algorithms.
    3. Comprehensive: Scans local and cloud storages, emails, databases. Both structured, unstructured, and images.
    4. Fast with a highly scalable architecture to process big data quickly.
    5. Easy to deploy and integrate using a turn-key Docker container, accessible through both a modern web interface (for humans) and REST API (for machines).

    PII Tools architecture

    How do I start?

    1. If you are new to PII Tools, start by reading the section on Installation and deployment.

    2. Read Running a scan on how to submit scanning requests to PII Tools through its web interface or REST API.

    3. Scan reports covers how to access and interpret the output PII Tools generates.

    4. For product support or suggestions, reach out to PII Tools support.

    Term glossary

    Term Meaning
    Document A digital artefact (file, database table, email…) that may contain personal information. Example: Word, CSV, Excel, PDF, scanned PDF with OCR, JPEG, web server log, Outlook, XML, JSON…
    Storage A repository containing documents to be scanned. Example: file share, Office 365, AWS S3 bucket, SQL database…
    PII Tools server Your locally deployed server that performs data discovery scans on documents and storages.
    Connector A software component inside PII Tools that knows how to reads documents from a particular type of storage. Example: End Device connector, SharePoint connector, MS SQL connector.
    Device Agent An executable file that is run on a file share server or local device, enabling scanning its content.
    Scan The process of automatically detecting personal information. Scans can be either batch or streamed.
    Batch scan A large scan that analyzes an entire storage or device at once, by pulling individual documents from it. Example: scanning an employee laptop; scanning an email archive; scanning an S3 bucket.
    Stream scan Scans a single individual document pushed to the server, returning the scanning results synchronously, in real-time. Doesn't access any storages. Example: scanning one PDF document, one Word document, one email.
    Inventory index PII Tools maintains a detailed index of all personal data detected across all batch scans. From this inventory, you can generate drill-down reports or run PII analytics for SAR requests.
    Scan report A summary report generated from a particular inventory index. Can be in drill-down HTML format for easy reviews, or in machine-readable JSONL format to answer automated SAR requests.
    Web interface, web UI Users can submit scanning requests and manage scanning results from an integrated (local) web interface.
    REST API Users looking to integrate PII Tools can also submit scans and generate reports by means of HTTPS requests to a PII Tools server.

    Data persistence and security

    Personal data is by definition sensitive — where and for how long does PII Tools store it?

    • For stream scans, no data is ever persisted. The HTTPS request (whether coming from the web UI or the REST API) is immediately executed, personal information detected and sent back as the request response. See Stream scans.

    • For batch scans, as the scan progresses, the detected information is being collected and persisted into an internal database called the "inventory index". This inventory index is used to generate reports and answer analytics queries. To permanently delete all information associated with a particular batch scan, call the Delete scan index API, or click the trash can icon in the web UI next to the scan under "Actions".

    • The original file content is never stored (mirrored) inside PII Tools. Only the detected PII is stored in the inventory index for batch scans.

    Anyone authorized to submit scan requests to a PII Tools server can also view all scans and generate scan reports on that server.

    All data is transmitted encrypted using the HTTPS protocol, such as between PII Tools and a remote device or cloud storage to be scanned. Since the PII Tools server is typically deployed on a local IP (without a public domain), it uses a local self-signed SSL certificate to enable HTTPS.

    No data is transmitted or stored outside the PII Tools server, nor are any external services called. Configuration parameters, such as access credentials to remote cloud storages (see Scan configuration), are kept internally inside PII Tools until the corresponding scan is deleted.

    Web interface

    In addition to the programmatic access via REST API, PII Tools also offers scanning capabilities through a user-friendly web interface.

    This web interface is installed automatically when you deploy PII Tools, and runs on the same address and port as the server itself (see Deployment).

    For example, if you deployed PII Tools on a machine with IP 195.201.160.29 and REST port 443, open your browser and go to https://195.201.160.29:4443.

    You should see a welcome screen like this:

    web UI welcome

    The web interface allows you to:

    The parameters exposed in the web UI correspond to (a subset of) parameters supported by the REST API. This means all operations that can be performed through the web UI can be also performed using REST, but not necessarily vice versa.

    REST API

    Sample stream scanning request against the PII Tools REST API:

    $ curl -k -s --user username:password -XPOST https://127.0.0.1:443/v3/stream_scan -H 'Content-Type: application/json' -d'
    {
        "filename": "bank_form.pdf",
        "content": "'$(base64 -w0 /tmp/bank_form.pdf)'"
    }'
    

    This request will generate a response like this:

    {
        "status": "SCANNED",
        "processing": {
            "_time": 0.2773430347442627,
            "_time_children": 0.2770969867706299,
            "_time_self": 0.0002460479736328125,
            "language": "en",
            "language_confidence": 1.0,
            "severity": "3-CRITICAL"
        },
        "pii": [
            {
                "confidence": 1.0,
                "pii": "Mustafa Abdul",
                "context": ", From : Name : Mustafa Abdul The Branch Manager Address :",
                "pii_category": "Personal",
                "pii_type": "name",
                "position": 105
            },
            {
                "confidence": 1.0,
                "pii": "2201 C Street NW I Washington, DC 20520",
                "context": "Abdul  \nThe Branch Manager                                 Address: 2201 C Street NW I Washington, DC 20520 \nBank of America                                 Phone No",
                "pii_category": "Personal",
                "pii_type": "address",
                "position": 181
            },
            {
                "confidence": 1.0,
                "pii": "GL28 0219 2024 5014 48 ",
                "context": "Account Transfer  \nA/c No. GL28 0219 2024 5014 48",
                "pii_category": "Financial",
                "pii_type": "bank_account",
                "position": 387
            }
        ],
        "storage": {
            "content_type": "application/pdf",
            "doctype": "pdf",
            "filename": "bank_form.pdf",
            "filesize": 47375,
            "location": "bank_form.pdf"
        },
        "errors": [],
    }
    

    Once the PII Tools service is running, users may issue programmatic scanning requests using its REST interface. The requests are described in detail in the Running a scan section and can be submitted from any language and environment, using standard libraries and tooling, such as Java, Python or C#.

    PII Tools uses HTTPS with Basic Authentication. Any non-authenticated requests are rejected. The username and password you can edit in docker-compose.yaml (see Deployment section).

    In order to continue to work even in local air-gapped installations, PII Tools uses a self-signed SSL certificate. Configure your HTTPS client to not check the certification authority, such as with curl -k in the examples to the right.

    Overview

    All REST requests follow the same structure:

    API URL structure

    • Request headers
      • use standard HTTP methods: GET (to retrieve an object), POST (to create), DELETE
      • parameters are always in JSON format (Content-type: application/json)
    • Protocol https://
    • Domain and port of the PII Tools server as configured during Deployment
    • PII Tools API version; currently v1
    • Parameters of the scanning action to take (see scan configuration)

    The REST API responses are in JSON too (Content-type: application/json), and will return an HTTP status according to the success/failure of each operation. PII tools uses a combination of HTTP status codes and descriptive error messages to give you a more complete picture of what has happened with your request.

    For example, if you request a non-existent resource, a 404 error is returned:

    $ curl -k -XGET https://username:password@127.0.0.1:443/v3/scans/1234
    
    HTTP/1.1 404 NOT FOUND
    {
        "_success": false,
        "error": "Parameter error: Scan with id 1234 not found."
    }
    
    HTTP status Meaning To Retry or Not to Retry?
    2xx Request was successful.
    Example: 200 Success
    4xx A problem with request prevented it from executing successfully. Never automatically retry the request.
    If the error code indicates a problem that can be fixed, fix the problem and then retry the request.
    5xx The request was properly formatted, but the operation failed on PII Tools's end. In some scenarios, requests should be automatically retried using exponential backoff.

    Basically, any request that did not succeed will return a 4xx or 5xx error and the JSON response will contain the {"error": "<message>"} field. The 4xx range means there was a problem with the request, such as a missing parameter. The 5xx range indicates an internal PII Tools error.

    Main REST endpoints

    This is a list of the main REST endpoints. For details and examples, see the main sections below.

    Endpoint                                         Purpose
    GET /status Get service overview status.
    GET /scans/ Get a list of all batch scans.
    GET /scans/?name_pattern=*est* Get a list of all batch scans matching a name pattern.
    POST /scans/ Launch a new batch scan.
    GET /scans/<scan_id> Get detailed metadata info for a scan.
    PUT /scans/<scan_id> Update a scan, for example pause or rename a scan.
    DELETE /scans/<scan_id> Delete a scan.
    GET /scans/<scan_id>/objects/<object_id> Get detailed metadata info for a file.
    GET /scans/<scan_id>/objects?format=X Download scan report in {audit, json, jsonl, csv, xlsx, html} format.
    POST /stream_scan Launch a stream scan, real-time scanning API.
    POST /analytics Run analytics over all scans and objects that match a query, download in one of {facets, csv, xlsx, xlsx_simple, html, json, jsonl, audit} formats.
    GET /analytics/_field_mapping Get mapping between query keys and backend names.
    GET /detectors/ Get all built-in and custom detectors.
    GET /detectors/builtin Get all builtin detectors.
    GET /detectors/custom Get all custom detectors.
    POST /detectors/custom Create a new custom detector.
    GET /detectors/custom/<detector_id> Get an existing custom detector.
    PUT /detectors/custom/<detector_id> Update an existing custom detector.
    DELETE /detectors/custom/<detector_id> Delete a custom detector.

    Supported scans

    Supported PI types

    The lyrics.txt file is a great litmus test for detection quality. It contains words like "medicine", "sexual" and "healing" used in non-personal context, which will (incorrectly) trigger many rule-based systems. PII Tools correctly ignores it as a false positive. We recommend running this file on any discovery tool you're evaluating, to check the results!

    The following types of personal and sensitive information are supported out of the box:

    Covered data PII types
    Personal full name, home address, face, phone number, date of birth, email, first name, last name, city, country
    Financial bank account number, credit card number, routing number
    Sensitive sexual preferences, race, gender, religious views
    Health Medicare IDs, personal health information (PHI), medical records, WHO ICD codes
    National passport and ID card scans, passport numbers, driving license, SSN, personal tax ID
    Security username, password, IP address

    You can also define your own detectors dynamically, using custom rules and regexps. See Custom Detectors.

    Supported storages

    In addition to Stream scans, PII Tools can scan entire storages. Here is the full list of PII Tools storage connectors available out-of-the-box:

    Storage scan_type Comment
    File shares device File shares, SMB and mounted drives are scanned using Device Agents.
    Filesystems device Both remote and local file systems are scanned using Device Agents.
    Devices and work stations device Windows, OSX and Linux computers are scanned using Device Agents.
    DropBox device Only locally synced Dropbox folders are supported: use device with root_folder pointed at the DropBox sync folder.
    Amazon S3 s3 Scan AWS S3 buckets.
    Google Drive gdrive Scan Google Drive storages, using either a refresh token or a service account.
    Microsoft SQL Server odbc Scan MS SQL databases, schemas and tables. Versions 2008, 2008R2, 2012, 2014, 2016, 2017 and Azure SQL.
    Oracle odbc Scan Oracle databases, schemas and tables. Supports both pluggable databases (PDB, Oracle 12c+) and 11g.
    Postgres odbc Scan Postgres and Amazon RDS databases, schemas and tables.
    MySQL odbc Scan MySQL and MariaDB databases and tables.
    Office 365: Exchange Online mgraph-exchange Scan Microsoft Exchange Online mailboxes and users.
    Office 365: OneDrive mgraph-onedrive Scan Microsoft OneDrive storages.
    Office 365: Sharepoint Online mgraph-sharepoint Scan Microsoft SharePoint Online sites.
    Microsoft Azure Blob azure-blob Scan Azure Blob storages.

    Supported file formats

    Use the free PII Tools trial to verify how PII Tools will process your particular files.

    PII Tools supports more than 400 file formats, including structured files (CSV, Excel, JSON, XML…) and unstructured files (PDF, email, Word, images, OCR, …). It will analyze files of different types accordingly, using the appropriate context parser, to maximize accuracy.

    For some document format conversions, PII Tools uses the Apache Tika framework internally. You can find the list of all supported file formats here.

    Supported archive formats include PST, MBOX, ZIP, RAR, TAR.

    Supported severity levels

    Not all personal information is created equal: an IP address in a web server log does not carry the same risk as a spreadsheet full of names, home addresses and credit card numbers.

    Considering data in context allows PII Tools to assess not only the presence, but also the severity of the detected information. Assigning severity levels to files improves the information filtering and review experience.

    PII Tools will automatically classify document into four severity levels:

    Severity Description
    NONE No personal data-related risk identified in this file.
    LOW Some potentially identifying information detected, such as an isolated IP address or user name. This personal data is also covered by GDPR, but people typically don’t care to protect this type of data.
    HIGH Sensitive data, a person would unhappy if made public. HIGH risk is also assigned when PII Tools detects a lot of PII, even if low risk, indicating a PII dump in risk of breach.
    CRITICAL Direct risk of identity theft, blackmail, financial damage or loss of job.

    Installation and deployment

    Code examples in this documentation use the curl command to send HTTPS requests. While curl is great for demonstrations, you can of course issue the same requests using your favourite web library, such as requests for Python or Unirest for Java.

    This section describes how to install PII Tools on your own server, whether on-premises or in your cloud.

    The installation process is simple and involves two main steps:

    1. Edit the service configuration file: set your desired username, password etc.
    2. Launch the service from its virtual image.

    The installation process requires a working network connection to download the virtual image, done by your own IT team, and takes 15-30 minutes.

    Installation contains

    As part of your purchase, you should have received:

    1. A license agreement plus one or more license keys.
    2. A docker-compose.yml file that bootstraps the deployment of your private PII Tools server.
    3. README.txt file containing the username and password for accessing RARE's private Docker registry.
    4. This documentation.

    Hardware requirements

    The PII Tools service can run on any machine that supports Docker, which includes MacOS, Microsoft Windows 10, Amazon Web Services (AWS), Microsoft Azure, IBM Cloud and Linux. This is because PII Tools is deployed by means of a fully configured, turn-key Docker image.

    The PII Tools server requires:

    • CPU cores
      • 4 cores absolute minimum
      • 32 cores recommended for best performance
      • Adding more CPU cores improves performance significantly, thanks to PII Tools' parallelized architecture
    • Free RAM
      • 6 GB of RAM plus an additional 1 GB RAM per scan worker absolute minimum
      • 64 GB RAM recommended for best performance
    • Free disk space
      • 8 GB of free disk space absolute minimum – plus 30 GB of free disk space per every 1,000,000 files in your scanned inventory
      • 1 TB recommended for best performance
    • Network
      • A fast HTTPS connection between the server and the storage to be scanned: your file share, S3 bucket, laptop, endpoint device, etc.

    The Device Agents for scanning local devices have no dependencies. They are simple executable files (".exe" and ".msi" on Windows, "binary" on Linux and OSX) that are run on the device to be scanned. They only must be able to connect to a running PII Tools server via HTTPS.

    How it works

    PII Tools and its dependencies are packaged as a Docker image provided by RARE Technologies from a secure Docker registry. You install this image locally, using the instructions below.

    At no point are any scanned documents sent to RARE Technologies or any 3rd parties. See also Data persistence and security. The installed service will be fully "local", and does not require nor talk to any external services or tools.

    Once installed, PII Tools can even run in air-gapped mode, without any internet access.

    After deployment, users submit requests to PII Tools via PII Tools' built-in web interface or via its REST API.

    Once deployed, PII Tools keeps running until explicitly terminated. There is no need to re-deploy PII Tools for each individual scan.

    Conceptually, PII Tools consists of three parts:

    1. A central PII Tools server that does the heavy lifting (document format conversions, detect PII with machine learning, worker parallelization). The server doesn't have any direct access to any of your documents.
    2. (for a local device scan) Device Agent, a small executable program you run on the device to be scanned: fileshare, laptop, desktop… It accesses the documents stored there and sends them to the PII Tools server for processing.
    3. (for a remote cloud scan) Connector that knows how to access documents on remote storages (S3, Azure, Office365, SQL Database…). It uses read-only API credentials you supplied in the scan configuration to access and scan data from the remote storage. See supported storages.

    For stream scans ("Quick scan" in the web interface), where you upload a single file to the PII Tools server for real-time analysis, no Device Agents or Connectors are necessary.

    Deployment

    1. Install Docker on the machine (server) where you wish to host PII Tools. Docker supports MacOS, Microsoft Windows 10, Amazon Web Services (AWS), Microsoft Azure, IBM Cloud, CentOS, Debian, Fedora and Ubuntu.

    2. Install Docker Compose.

    3. Windows and OSX: Increase the RAM and CPU available in Docker Advanced Settings. As a rule of thumb, allow as many cores as possible, and 8 GB of RAM plus extra 1 GB of RAM per core. (This is not needed on Linux servers, where virtualization is more efficient and can use all hardware resources by default.) Daemon parameters

    4. Run docker login registry.pii-tools.com --username <USERNAME> --password <PASSWORD> to log into the private Docker registry of PII Tools. <USERNAME> and <PASSWORD> were provided to you as part of your license purchase in README.txt (see Installation contains). If you authenticated successfully, you'll see a Login Succeeded message in your console.

    5. Edit the docker-compose.yml configuration file provided to you as part of your purchase with a text editor. This YAML file contains critical instructions for PII Tools configuration:

    • Set LICENSE_KEY to your license key. PII Tools won't function without a valid license key.
    • Set USERNAME and PASSWORD according to your preferences. These will be the username and password you use to log in to the web interface or issue API requests.
    • Set NUM_SCAN_WORKERS to the number of worker processes (CPU cores) you wish to use for parallelization (default: 4; recommended: #CPU cores minus two, e.g. 6 for an 8-core machine).
    • Change HOST, REST_PORT and AGENT_PORT to the IP and ports you want PII Tools server to run on. Defaults:
      • run the web service on localhost, port 443 (127.0.0.1:443)
      • run the Device Agent server on localhost, port 1789 (127.0.0.1:1789)

    The defaults are HOST=127.0.0.1 (localhost) for security reasons, but most of the time you'll want PII Tools to bind to an external IP, so the server is visible from outside machines. We recommend changing 127.0.0.1 to 0.0.0.0, which will make PII Tools bind to all available interfaces (IP addresses) of the machine where you install PII Tools.

    Unless ports 443 and 1789 are already in use, we recommend keeping these default values.

    Save the edited configuration file without changing its file name (docker-compose.yml), and exit the text editor.

    6. Run docker-compose -f docker-compose.yml up -d. This process may take a while (15-30 minutes, depending on your internet connection speed), but is only done once, at the PII Tools server installation time.

    To test that the installation was successful and the REST API is active, run this command:

    $ curl -k -XGET https://username:password@127.0.0.1:443/v3/status
    

    After which you should see:

    {
        "uptime": "0d 5h 26m",
        "version": "3.0.0",
        "customer_name": "ACME CORP",
        "license_type": "enterprise",
        "expires": "2022/01/02",
        "hostname": "0.0.0.0",
        "rest_port": 443,
        "agent_port": 1789,
        "num_rest_workers": 15,
        "num_scan_workers": 4,
        "rest_worker_timeout": 60,
        "scan_worker_timeout": 60,
        "total_scans": 0,
        "unfinished_scans": 0
    }
    

    These six steps above will:

    • Install all dependencies and PII Tools itself.
    • Launch the Analytics web user interface. Access it in your internet browser at https://127.0.0.1:443 by default (see above for how to configure a different host, port, username or password).
    • Launch the PII Tools REST service. Use the example to the right to run your first API request (again, replace the host, port, username and password according to your own config values you set above).

    At this point, the service is running and ready to use. Congratulations!

    new installation screenshot

    Software maintenance

    To stop PII Tools without erasing your inventory (non-destructive stop), execute this command on the machine that hosts the PII Tools server:

    $ # Stops service; no data is lost.
    $ docker-compose -f docker-compose.yml stop
    
    Stopping pii_tools         ... done
    Stopping inventory         ... done
    

    PII Tools operates as a long-running service and does not require any maintenance. After each product upgrade, you might wish to run docker system prune to remove images of old releases, in order to reclaim disk space.

    To terminate PII Tools, simply stop its Docker container using the command to the right.

    You can resume later, without losing any indexed data, by running step 6) from Deployment again:

    $ docker-compose -f docker-compose.yml up -d
    

    To terminate PII Tools and wipe all inventory indexes (warning! this action will erase all data from PII Tools and cannot be undone!), run docker-compose -f docker-compose.yml down --volumes instead. Use this to reset PII Tools to a clean, fresh installation.

    Support

    When submitting a support request, please be clear in your description of the problem: What result did you observe? What did you expect instead? Ideally with screenshots. This helps us resolve your request faster.

    PII Tools support may ask you for your service log. Capture the service log with the following command, zip the created file acmecorp_2020_01_21.log and attach to your support request:

    $ docker-compose logs --no-color > acmecorp_2020_01_21.log
    

    Also include your service health info please, which you get by clicking the notepad icon here:

    check PII Tools version

    Product upgrade

    To check your current service version, click on in the top-right screen corner in the UI, or run this REST request:

    $ curl -k -XGET --user username:pwd https://127.0.0.1:443/v3/status
    
    {
        "uptime": "0d 18h 2m",
        "version": "3.0.0",
        "customer_name": "ACME CORP",
        "license_type": "enterprise",
        "expires": "2022/01/02",
        "hostname": "0.0.0.0",
        "rest_port": 443,
        "agent_port": 1789,
        "num_rest_workers": 15,
        "num_scan_workers": 4,
        "rest_worker_timeout": 60,
        "scan_worker_timeout": 60,
        "total_scans": 0,
        "unfinished_scans": 0
    }
    

    From time to time, RARE Technologies may release a new version of PII Tools with upgrades and bug fixes. If your license allows for it, this upgrade is made available to you by means of a new Docker image and an updated docker-compose.yml configuration file.

    To install an upgrade (optional), read its release notes carefully. If you wish to proceed:

    1. (optional) Back up your custom detectors and exclusions using the Export function.

    2. Do the release notes call for wiping your inventory?

      • Yes: Stop the service with docker-compose -f docker-compose.yml down --volumes. Warning! This will reset PII Tools to a clean, fresh, empty installation. Any existing scans, custom detectors and exclusions will be permanently removed.
      • No: Stop the service with docker-compose -f docker-compose.yml down. Your existing inventory (existing scans, custom detectors, exclusions) will be carried forward to the upgraded version unaffected.
    3. Edit the new docker-compose.yml configuration file from the new upgrade, re-applying your settings from step 5) in Deployment.

    4. Relaunch the service with docker-compose -f docker-compose.yml up -d. The restarted service will use the upgraded version. To verify you're indeed running the new version, open the PII Tools web UI and click the ⓘ button in the top-right corner.

    5. (optional) Re-import your state backup from step 1: Import function.

    That's it, your upgraded version is now active, congratulations!

    check PII Tools version

    If you wish, you may remove old Docker images after each upgrade, to reclaim disk space. The Docker images of older releases are no longer needed and can be safely removed with the command docker system prune.

    Authenticating connectors

    Some connectors, such as Office 365, Google Drive or Amazon S3, require authorizing PII Tools in order to scan the data stored inside.

    To streamline the process of authorizing PII Tools and obtaining the necessary credentials, we prepared the step-by-step instructions with screenshots below. But keep in mind that in principle, you can obtain the necessary parameters any other way. These instructions are just a guideline for your convenience. PII Tools only needs the access credentials as input in order to run a scan, no matter where you got them from.

    Microsoft Office 365

    Microsoft Graph is Microsoft's API for accessing data stored on Microsoft Office 365 services, such as Exchange Online, OneDrive, and SharePoint Online.

    In order for PII Tools to scan data inside Office 365, you'll need the following access credentials. This section describes how to obtain them in detail:

    • client ID (client_id),
    • client secret (client_secret)
    • tenant ID (tenant_id)

    In a nutshell, PII Tools needs to be registered by an administrator in the Microsoft Azure Registration Portal. This creates the client_id and client_secret for PII Tools. tenant_id is the ID of the organization whose data is to be scanned by PII Tools, i.e. your company.

    Prerequisites

    • An Microsoft Office 365 account with administrator privileges.
    • PII Tools deployed on a server accessible from your local computer. See Deployment. We will refer to this server as https://<pii-tools-server-ip-address-and-port>/ below.

    Registering PII Tools

    1. Go to https://portal.azure.com/#blade/Microsoft_AAD_RegisteredApps/ApplicationsListBlade and log in as an administrator.

    2. Click on New registration in the top left corner: add an app

    3. On the Register an application form:

      • Set Name to "PII Tools".
      • Fill in https://<pii-tools-server-ip-address-and-port>/mgraph_auth into the Redirect URI, replacing <pii-tools-server-ip-address-and-port> with your PII Tool server IP address. For example, if you installed PII Tools at 175.28.1.10 and port 443, fill in https://175.28.1.10:443/mgraph_auth here.
      • Click on Register. create app
    4. On the Overview page of the newly created application:

      • Take note of the Application (client) ID. This is your client_id.
      • Take note of the Directory (tenant) ID. This is your tenant_id.
      • Next, click "View API permissions". generate new password
    5. On the PII Tools - API permissions page

      • Click on Add a permission.
      • In the pop up, select Microsoft Graph and then Application permissions (not "Delegated permissions"!).
      • Select the following permissions, by entering each permission into the Type to search box and then clicking the checkbox to the left of the permission to add it:
        • Directory.Read.All (required for OneDrive and SharePoint Online)
        • Files.Read.All (required for OneDrive and SharePoint)
        • Mail.Read (required for Exchange)
        • Sites.Read.All (required for OneDrive and SharePoint Online)
        • User.Read.All (required for Exchange and OneDrive)
        • When done adding these 5 permissions, click the Add permissions button at the bottom of the screen. finding permissions
      • You can also select only a subset of the permissions if you are not going to use all available connectors. For example, you can exclude Mail.Read if you're not going to scan Exchange Online data.
        • You'll be able to adjust these permissions at any time in the future, by revisiting this Azure Portal page and changing the settings.
      • Scroll down to the bottom of the page and click on Grant admin consent for <my organization>. selected permissions
    6. Go to the Certificates & secrets page in the left menu and:

      • Click New client secret near the bottom of the screen. A sub-window with Description and Expiration will pop up.
      • Enter mgraph API secret into Description.
      • Select Expires: Never.
      • Click Add to confirm. create client secret
      • Take note of the generated Value: this is your client_secret. copy client secret

    Congratulations. You are now ready to scan your Microsoft Office 365 data, using the client_id, tentant_id and client_secret obtained above. See Running a scan.

    Security notes

    The client_secret is required for PII Tools to authenticate against the Microsoft Graph API and needs to be provided when initializing an Office 365 scan (Exchange, OneDrive, or SharePoint). If you lose your Office 365 client_secret, PII Tools cannot help you retrieve it.

    Google Drive

    To scan a Google Drive storage, you'll need to obtain one of the following OAuth credentials:

    1. client_id, client_secret and refresh_token (to scan a single GDrive account)
    2. JSON credentials for GSuite service account (for domain-wide scanning)

    Authenticate GDrive using tokens

    In order to obtain the refresh_token credentials, you (the admistrator of PII Tools) must take these two steps, explained in more detail below:

    1. register the PII Tools application in the Google APIs
    2. grant the application access to the files to be scanned

    Follow the instructions at http://www.howtosolvenow.com/how-to-get-refresh-token-for-google-drive-api/ to generate a refresh_token for the desired account. When prompted for permission scope, enter https://www.googleapis.com/auth/drive.readonly. This will allow PII Tools to read (and nothing but read) data from the target drive.

    Authenticating GDrive using a service account

    Service accounts are more convenient than tokens in case you are the domain administrator, and wish to scan Google Drives of multiple users. Instead of generating a token for each user account, which can be tedious, you can set up one service account to impersonate any user in your domain.

    To set up a service account and delegate authority, follow the official Google steps at https://developers.google.com/identity/protocols/OAuth2ServiceAccount#delegatingauthority. The only permission scope required by PII Tools is https://www.googleapis.com/auth/drive.readonly.

    Microsoft Azure Blob

    To scan an Azure Blob storage, the account_name and account_key are needed.

    In order to obtain these credentials:

    1. Log into the Azure Portal.
    2. Choose Storage accounts in the sidebar menu and then select the blob storage to be scanned. select blob storage
    3. Choose Access keys from the left hand side sub-menu. Find your account_name under Storage account name and your account_key under key1: Key. locate credentials

    Device Agents

    Device agents (DAs) are thin clients that scan a filesystem (PC, Windows, MacOSX, Linux, laptop, file shares…). Each DA runs locally as a small program (a single binary file, .exe or .msi) on the target device, and communicates with a running PII Tools server over the network. One PII Tools server can be associated with many devices.

    Device agents are long-running processes that can be used for a single scan, or repurposed across multiple scans or scheduled repeat scans.

    Installing DA

    To install a DA, copy the appropriate binary for the device's operating system (Windows, Linux, OSX) to the machine you want to scan, either manually or in bulk using Active Directory if you have many computers (see the instructions below).

    These device agent binaries can be downloaded from your PII Tools dashboard:

    device agent download

    The installation will require four parameters.

    1. Base Folder is a folder path that restricts which parts of this machine PII Tools may scan, such as C:\ or %homepath% or /home/jake/public. When launching a new agent scan, only scans inside this Base Folder directory will succeed; any scans outside this directory will automatically fail. Leave Base Folder empty to allow scanning of any location on this device (no restriction).
    2. Token is the unique identifier of this device. The device will be visible under this name in the PII Tools dashboard. For example, you can set the token to this device's IP address (e.g. 192.168.20.1), or to any other name that's meaningful to your organization (e.g. HR department: Mike's laptop). The maximum token length is 255 characters.
    3. REST port and Host are the REST_PORT and HOST parameters from your PII Tools installation. This is how your agent knows which PII Tools server to connect to. These two parameters are the same across all agents.

    Windows Installation

    To install a Device Agent on a Windows machine, double-click the pii-agent-windows.msi installer you downloaded here, and follow the installation instructions on your screen.

    MSI configuration

    1. Base Folder is a folder path that restricts which parts of this machine PII Tools may scan, such as C:\ or %homepath% or /home/jake/public. When launching a new agent scan, only scans inside this Base Folder directory will succeed; any scans outside this directory will automatically fail. Leave Base Folder empty to allow scanning of any location on this device (no restriction).
    2. Token is the unique identifier of this device. The device will be visible under this name in the PII Tools dashboard. For example, you can set the token to this device's IP address (e.g. 192.168.20.1), or to any other name that's meaningful to your organization (e.g. HR department: Mike's laptop). The maximum token length is 255 characters.
    3. REST port and Host are the REST_PORT and HOST parameters from your PII Tools installation. This is how your agent knows which PII Tools server to connect to. These two parameters are the same across all agents.
    4. Run on startup: Select this if you'd like the Device Agent run automatically on machine startup in the background, for all users. You'll need Windows administrator privileges to enable this option.

    The installation will create a Desktop shortcut on your device. Running this shortcut will automatically launch the agent, without any need of further configuration. Leave the agent running to allow scanning of this device.

    Remote Windows Installation

    In some environments, you may want to install Device Agents on a large number of Windows machines at once (for example using Active Directory), instead of going through the installation manually on each machine.

    In this case, you can use the MSI installer package with the "quiet" (headless) option, and install the agent remotely to multiple machines at once.

    The headless installation command is:

    msiexec /quiet /package "pii-agent-windows.msi" BASE_FOLDER="C:\" SERVER_REST_PORT="443" SERVER_HOSTNAME="127.0.0.1" TOKEN="My laptop" RUN_ON_STARTUP="0"
    
    • The pii-agent-windows.msi installer file can be downloaded from the PII Tools dashboard: device agent download
    • The quiet option enables silent installation, without any user prompts.
    • RUN_ON_STARTUP: Choose 0 to not run on startup; 1 to run on startup for all users; 2 to run on startup for the installing user only.
    • The rest of the parameters have the same meaning as above.

    Launch DA on device startup

    In case you want to scan the same device repeatedly, we recommend launching the device agent on machine startup, and leave the agent running in the background. This means the same token will be associated with this device, and you can (re)launch scans easily on that device in the future.

    To launch a Device Agent on startup, add this command to your machine(s) startup process:

      # For Linux
      ./pii-agent-linux cli --hostname 175.201.160.29 --port 443 --token "my machine 1" --base-folder "/home"
    
      # For Windows
      # See also the MSI installer above, which has an option "Run PII Agent on startup".
      pii-agent-windows.exe cli --hostname 175.201.160.29 --port 443 --token "my machine 1" --base-folder "C:\"
    
      # For Mac OSX
      ./pii-agent-osx cli --hostname 175.201.160.29 --port 443 --token "my machine 1" --base-folder "/Users/"
    

    This will launch the Device Agent on the given device, without having to configure its parameters manually. Use a unique token on each device.

    In case you installed the agent on Windows using the MSI installer, launch the created shortcut on startup, no further parameters required.

    The agent will remain running, waiting for scanning instructions from the PII Tools server.

    Launching DA manually

    As an alternative to the options above, you can also launch a Device Agent on the target machine manually. This option can be used on all operating systems: Windows, Linux and OSX.

    1. Copy the DA executable to the machine you want to scan. Double click the executable file (for example, pii-agent-windows.exe for Windows users) to launch the device agent configuration screen.

    2. The agent will automatically open a new window in your default browser, allowing you to configure the agent. If the window does not open automatically, copy&paste the link shown in the pii-agent-windows.exe window manually into your internet browser:

    device agent console

    1. On the configuration page in your browser, fill in these four fields and then press Submit:
      • Token that will identify this device (e.g. Jenny's workstation).
      • Base folder - only allow scans inside this folder. Attempts to scan locations outside this folder on this machine will fail. Example: scanning D:/ will not be allowed if the Base folder is C:/Users/.
      • Hostname of your PII Tools server - hostname (IP address) of the main PII Tools server (for example 24.53.168.9). This PII Tools server must be reachable from the machine running the DA.
      • REST port of your PII Tools server - REST port of your PII Tools server (for example 443). See Deployment.

    device agent configuration page

    Press Submit. If you configured the agent correctly, you'll see a success page and you can start scanning this device:

    device agent success

    If something goes wrong or the agent cannot connect to the server, you'll see an error page. In this case, fix the error and try again.

    device agent fail

    Running DA scans

    Run scans against a running device agent from the PII Tools server as described in Running a scan. Use the token specified above to identify which device agent you want to scan.

    You can have multiple device agents associated with a single PII Tools server, or even with a single device. All tokens must be unique though – two agents must never share the same token.

    Stopping DA

    To terminate a running device agent, simply close the executable (e.g. pii-tools-windows.exe, click X in the top right corner) and its window.

    device agent close

    If you close the DA window while a scan is running, the scan will be interrupted and marked as "FAILED".

    After terminating the Device Agent, no more scans will be possible against this machine. To re-enable scans on this device, you must follow the above steps to re-launch the Device Agent.

    Running a scan

    Scanning documents for sensitive and personal data is the main functionality of PII Tools. This section contains information on how scans work and how to configure and process scanning requests using a REST API.

    To run a scan using the web interface, click the "Launch new scan" button in the top-right corner of the "Analytics" tab, and follow the instructions in the right-hand side panel.

    new scan screenshot

    When using the REST API, you launch a new scan by POSTing its parameters to the /scans or /stream_scan endpoint, or clicking the corresponding buttons in the web interface.

    A scan configuration defines what is to be scanned (input), using what PII detectors, and what to do with the results (output): see Scan configuration.

    Multiple scans can be submitted to a single PII Tools instance, even at the same time, concurrently. Each scan gets its own scan name and scan ID which you may use to check the scanning progress and retrieve the scanning report at the end.

    Conceptually, PII Tools supports two types of scans:

    1. A batch scan, which runs asynchronously in pull mode, actively fetching documents from the storage to be scanned (local directory, remote S3 bucket, email archive, database…). Instances of discovered personal data from each document are stored within an inventory index, from which a scan report is generated once the scan is complete.

    2. An stream scan, which runs in push mode, accepting a single document or piece of text on input. Stream scan is synchronous and returns any discovered personal data right away, in real-time. With strean scanning, no data is stored locally within PII Tools.

    crawler_pool

    Once a scan is launched, PII Tools immediately starts running its detectors on the input data. The scanning is parallelized for performance, using a distributed pool of scan workers as configured during deployment. In this way, multiple files are being analyzed concurrently.

    Scan configuration

    A scan configuration is a JSON request payload that defines what is to be scanned (input), using what detectors, and what to do with the results (output).

    In its simplest form, without any of the optional parameters, a full configuration for a stream scan looks like this:

    {
        "filename": "notes.txt"
        "content": "Contents of notes.txt, in base64 encoding."
    }
    

    or for an email:

    {
        "storage_parameters": {
            "content": "Contents of email.eml, in base64 encoding.",
            "filename": "email.eml",
            "cleanup_email": true
        }
    }
    

    For a Device Agent scan:

    {
        "scan_type": "device",
        "storage_parameters": {
            "token": "24539"
        }
        "root_folder": "C:/Downloads/"
    }
    

    For an S3 cloud scan:

    {
        "scan_type": "s3",
        "storage_parameters": {
            "aws_secret_access_key": "--== AWS_SECREST_ACCESS_KEY ==--",
            "aws_access_key_id": "--== AWS_ACCESS_KEY_ID ==--",
            "bucket": "BUCKET_NAME"
        },
        "root_folder": "some/path/inside_bucket/"
    }
    

    For a Microsoft SQL Server database scan:

    {
        "scan_type": "odbc",
        "storage_parameters": {
            "server": "pii-test.database.windows.net:1433",
            "db_type": "mssql",
            "username": "user",
            "password": "pwd"
        }
        "root_folder": "my_database/my_table"  # or empty, to scan all databases and tables
    }
    

    For an Oracle database scan:

    {
        "scan_type": "odbc",
        "storage_parameters": {
            "server": "175.201.160.29:1521/ORCLPDB1",
            "db_type": "oracle_12c",
            "username": "user",
            "password": "pwd"
        }
        "root_folder": "MY_SCHEMA/MY_TABLE"  # or empty, to scan all schemas and tables
    }
    

    Available scan parameters

    Example input configuration for a batch scan, scanning all files in the S3 bucket acme_backups under /backups/2018 while ignoring files ending in txt, doc or docx:

    {
        "scan_type": "s3",
        "storage_parameters": {
            "aws_access_key_id": "AKIA1234567890123456",
            "aws_secret_access_key": "abCD1234567/qB6",
            "bucket": "acme_backups"
        },
        "root_folder": "/backups/2018",
        "reject_filenames": ".*(txt|doc|docx)$"
    }
    

    Example input configuration for a device scan of C:\Users of agent 34588, accepting only ZIP files:

    {
        "scan_type": "device",
        "storage_parameters": {
            "token": "34588"
        },
        "root_folder": "C:/Users/",
        "accept_filenames": ".*(zip)$"
    }
    

    This is the list of available parameters you may use when launching a batch or stream scan:

    Parameter Type Description Available Default
    scan_name String Scan will appear under this name in the inventory batch mandatory
    scan_type String Type of storage to scan (see below). batch mandatory
    storage_parameters Object Access credentials for the particular storage type. batch mandatory
    root_folder String (optional) Only scan files under this location. Storage-specific. batch "" (scan everything)
    content String Raw base64-encoded document content. stream mandatory
    filename String File name of the file being scanned. stream mandatory
    cleanup_email Bool (optional) Automatically detect email headers and signatures in emails, and then exclude them from PII analysis. batch and stream false
    use_ocr Bool (optional) Run OCR on documents and images? Can lead to much slower processing. batch and stream false
    scan_views Bool (optional) Also scan SQL views? Affects only database scans. batch false
    detectors List[String] (optional) List of detector names to use in this scan. If not provided, use all available detectors. batch and stream
    reject_filenames String (optional) Skip all files whose filename (including path) matches this regular expression. Case insensitive. batch ^$ (skip nothing)
    accept_filenames String (optional) Skip all files whose filename (including path) doesn't match this regular expression. Case insensitive. batch .* (skip nothing)
    max_age Integer (optional) Incremental scans: Skip files with "last modified" time older than this many seconds. batch no age restriction
    min_age Integer (optional) Incremental scans: Skip files with "last modified" time newer than this many seconds. batch no age restriction
    download_max_bytes Integer (optional) Download at most this many bytes from file. batch and stream 5000000 (5 mB)
    analyze_max_text Integer (optional) Analyze at most this many characters from extracted plain text per file. batch and stream 10000 (10 kB)
    analyze_max_rows Integer (optional) Analyze at most this many rows from tables (in spreadsheets, databases etc). batch and stream 100
    row_batch_size Integer (optional) Analyze table rows in batches of this many rows. batch and stream 100
    pdf_resolution Integer (optional) DPI resolution for processing PDFs as images. batch and stream 50
    max_images Integer (optional) Process at most this many pages as images, for example from PDFs. batch and stream 5
    max_dir_depth Integer (optional) Don't descend into directories deeper than this. batch 20
    archive_passwords List[String] (optional) List of passwords to try on encrypted archives. batch and stream []
    apply_exclusions Bool (optional) Apply active exclusion rules to the scan output stream true

    Root folder

    The root_folder parameter in batch scans is interpreted based on the type of scan:

    1. For file storage scans (s3, gdrive, device etc): only scan files under this directory.
    2. For database scans (MS SQL, Oracle etc):
      • "root_folder": "" (default): Scan all tables under all databases.
      • "root_folder": "database_name": Scan all tables under a specific database.
      • "root_folder": "database_name/table_name": Scan tables named table_name under a specific database.
      • "root_folder": "database_name/schema_name/table_name": Scan the specified table under the specific schema and database.
    3. For Microsoft Office 365 scans, see the documentation of the particular scan types below.

    See Supported Storage Connectors for the full list of supported storage connectors.

    Specifying which detectors to use

    Example: launch an AWS S3 scan, using only the face, password and name detectors:

    curl -k -XPOST --user username:pwd https://127.0.0.1:443/v3/scans -H 'Content-Type: application/json' -d'
    {
        "scan_type": "s3",
        "scan_name": "My first scan",
        "storage_parameters": {
            "aws_secret_access_key": "AKIA1234567890123456",
            "aws_access_key_id": "abCD1234567/qB6",
            "bucket": "contract_backups"
        },
        "root_folder": "",
        "detectors": ["face", "password", "name"]
    }'
    

    To specify which detectors to use in a batch scan, define the "detectors": ["name_1", "name_2"] parameter in the scan configuration. The available names can be retrieved via GET /v3/detectors (see list all existing detectors GET endpoint).

    Storage-specific parameters

    Scan type device

    storage_parameters Type Description
    token String Token for the Device Agent to scan. See Device agents.
    tokens List[String] List of tokens for multiple Device Agents to scan. Each device scan will appear as a separate item in your inventory. The suffix "-token" will be automatically appended to each of these individual scan names, in order to differentiate them in the dashboard.

    See Device Agents for how to install agents and scan local and remote filesystems and file shares.

    Scan type s3

    storage_parameters Type Description
    bucket String S3 bucket to scan.
    aws_access_key_id String AWS access key ID for the bucket.
    aws_secret_access_key String AWS secret for the bucket.

    Scan type gdrive

    Scan files in Google Drive storage. Please see Authenticating connectors for how to obtain the credentials.

    With GDrive, root_folder has to be set either to:

    • root to scan the entire Google Drive storage, or
    • folder ID to scan the contents of particular folder.

    The folder ID can be retrieved from the URL where the folder can be accessed in Google Drive by taking the string after the last forward slash. For example, in https://drive.google.com/drive/u/2/folders/1bzcnvs3UCr9t_yWvWYcPSUXGrMna9F79, the folder ID is 1bzcnvs3UCr9t_yWvWYcPSUXGrMna9F79.

    Google Drive offers two different ways of scanning: using a refresh, or using a service account.

    GDrive using refresh token
    storage_parameters Type Description
    client_id String Client ID.
    client_secret String Client secret key.
    refresh_token String Refresh token.
    GDrive using service account
    storage_parameters Type Description
    service_account String Service account credentials, as JSON string.
    delegated_subject String GSuite user to impersonate during the scanning. If not specified, scan the service account itself (no impersonation).

    Scan type odbc

    storage_parameters Type Description
    server String Host and port where the database server is running.
    db_type String Type of database (see below).
    username String Username for SQL Server.
    password String Password for the specified username.

    Supported db_type types:

    • mssql: SQL Server (version 2008, 2008R2, 2012, 2014, 2016, 2017 and Azure SQL).
    • oracle_12c: Oracle 12c and later database.
    • oracle_11g: Oracle 11g and earlier database.
    • postgres: PostgreSQL database, version 8 and later.
    • mysql: MySQL or MariaDB database, version 5.1 and later.

    To be able to connect to a database, you may need to allow remote access to the IP address where PII Tools Server is running. For example, for Azure MS SQL, this can be done via the Azure portal:

    mssql_azure

    Set root_folder to the desired database, schema and table within your database installation. The supplied username must have at least read-access to the selected tables.

    Scan type azure-blob

    Scan files in Microsoft Azure Blob storage. Please see Authenticating connectors for how to obtain the necessary credentials.

    storage_parameters Type Description
    account_name String Account name for a particular Azure Blob storage.
    account_key String Secret key for the account.
    container String (optional) Container to be scanned. If not specified, all containers in the storage will be scanned.

    The root_folder can optionally be set to a prefix within the container. The root_folder value is ignored when scanning all containers (i.e., when container is not specified).

    Scan type mgraph-exchange

    Scan emails in Microsoft Exchange Online. Please see Authenticating connectors for how to obtain the credentials.

    storage_parameters Type Description
    client_id String PII Tools client ID
    client_secret String PII Tools client secret
    tenant_id String Organization's tenant ID

    The root_folder of Exchange Online scans can be set to one of:

    • Empty string "": will scan all emails for all users.
    • user_id: scan emails for one specific user. Example: john@my_company.onmicrosoft.com.
    • user_id/folder_id: Scan emails for one specific user in a specific folder, and all its subfolders. Examples: john@my_company.onmicrosoft.com/sentitems, john@my_company.onmicrosoft.com/inbox.
    • user_id/ArchiveMsgFolderRoot: Scan emails inside the In-Place Archive mailbox. The In-Place Archive mailbox is an extra Exchange Online feature, not available in all Office 365 plans.

    Scan type mgraph-onedrive

    Scan emails in Microsoft OneDrive. Please see Authenticating connectors for how to geet the Office 365 access credentials.

    storage_parameters Type Description
    client_id String PII Tools client ID
    client_secret String PII Tools client secret
    tenant_id String Organization's tenant ID

    The root_folder must be one of the following:

    • users - scan drives for all users
    • users/{user-principal-name} - scan drives for a single user
    • groups - scan drives for all user groups
    • groups/{group-name} - scan drives for groups with the given name
    • sites - scan all documents inside your organization's "root" site, and all its subsites
    • sites/{site-identifier} - scan all documents for a given site, and all its subsites

    root_folder examples:

    • users/john@acmecorp.onmicrosoft.com
    • groups/My Group/
    • sites/acmecorp.sharepoint.com:/sites/MySite

    When scanning a site, the site URL translates to site-identifier as follows:

    site-identifier = {site-host}:/{site-relative-path}

    For example, the site URL https://acmecorp.sharepoint.com/sites/MySite corresponds to root_folder = sites/acmecorp.sharepoint.com:/sites/MySite (mind the colon after the host name).

    Scan type mgraph-sharepoint

    Scan all documents inside a Microsoft Sharepoint Online site, and all its subsites. Please see Authenticating connectors for how to get the Office365 access credentials.

    storage_parameters Type Description
    client_id String PII Tools client ID
    client_secret String PII Tools client secret
    tenant_id String Organization's tenant ID

    The root_folder must be set to the site-identifier of the Sharepoint site to be scanned.

    The site URL translates to site-identifier as follows:

    site-identifier = {site-host}:/{site-relative-path}

    For example, your site URL https://acmecorp.sharepoint.com/sites/MySite corresponds to root_folder = acmecorp.sharepoint.com:/sites/MySite (mind the colon after the host name).

    Batch scans

    Batch scans are long-running scans against an entire folder, device or storage (database, cloud document storage). The API endpoints below show how to launch a scan, track its progress and generate a report for finished scans.

    Internally, each running batch scan indexes the detected information into a database, called "inventory index". See also Data persistence and security.

    Once the scan has completed, you can download its results in multiple report formats (drill-down HTML, Excel, CSV, JSON).

    For forensic purposes, you can also download an Audit log of all scanned objects, including their exact access timestamps and location.

    To set up a repeat scan that will automatically launch at regular intervals (daily, weekly, monthly etc), see the Scheduler.

    Launch batch scan

    Launch a batch scan of S3 bucket contract_backups under the scan id s3_contracts_march2018, against a PII Tools server that's running on 127.0.0.1, REST port 443:

    $ curl -k -XPOST --user username:password https://127.0.0.1:443/v3/scans -H 'Content-Type: application/json' -d'
    {
        "scan_type": "s3",
        "scan_name": "S3 backups",
        "storage_parameters": {
            "aws_secret_access_key": "AKIA1234567890123456",
            "aws_access_key_id": "abCD1234567/qB6",
            "bucket": "contract_backups"
        }
    }'
    

    POST /scans

    Launch a batch scan, using the provided scan configuration. Runs asynchronously. The request will return immediately; see Batch status for checking the scan progress.

    The response will contain scan_id assigned to this newly launched scan. Use this scan ID in all REST API operations related to this scan: when querying the scan progress, deleting the scan, etc.

    Batch status

    Check the progress status of the scan with id 7:

    $ curl -k -XGET --user username:password https://127.0.0.1:443/v3/scans/7
    

    Request response:

    {
        "_request_seconds": 0.062,
        "_success": true,
        "config": {
            "scan_name": "s3 scan",
            "scan_type": "s3",
            "root_folder": "",
            "storage_parameters": {
                "aws_access_key_id": "…",
                "aws_secret_access_key": "…",
                "bucket": "my_bucket"
            }
        },
        "end_time": "2019-07-25 14:44:27.046453",
        "last_object": "my_bucket/archives/archive.rar//archive/subdir/resume.xml",
        "objects_per_hour": 46836.0,
        "objects_scanned": 991,
        "objects_skipped": 10,
        "pii_tools_version": "3.0.0",
        "scan_id": "7",
        "scan_name": "s3 scan",
        "scan_type": "s3",
        "start_time": "2019-07-25 14:43:10.106867",
        "status": "FINISHED",
        "status_message": "Scan completed successfully.",
        "time_elapsed": "0d 0h 1m 16s"
    }
    

    GET /scans/{scan_id}

    Query for status of a batch scan with the given scan ID.

    Returns

    Parameter             Type Description
    status String Scan status. One of "RUNNING", "TERMINATING", "PAUSED", "FINISHED", "FAILED" (see below).
    status_message String Additional information associated with the scan status.
    last_object String Location of the last object scanned so far. Used to show scan progress while the scan is under way.
    config Object Original config used to launch the scan. Use to re-launch the same scan, or to verify the scan settings.
    objects_scanned Integer Number of successfully scanned files.
    objects_skipped Integer Number of files for which the scanning was skipped. This can happen for binary files when the file size is too large (over download_max_bytes) AND the analysis cannot be done on a partially downloaded content only. An example would be a large JPEG image.
    objects_failed Integer Number of files for which scanning failed.
    start_time String Date and time the scan started.
    end_time String Date and time the scan ended. Applies only to scans that already finished.
    time_elapsed Float How long has the scan been running so far?
    error String Error message. Only available if status is "FAILED".

    Status reference

    • RUNNING - Scan in progress.
    • PAUSED - Scan is paused.
    • TERMINATING - Scan is ending, cleaning up.
    • FINISHED - Scan finished successfully.
    • FAILED - Scan failed. The error field contains a detailed error message. Note that scans manually terminated by the user are considered FAILED.

    Download report

    Download the drill-down HTML report for scan id 13 into the current directory:

    $ curl -k -XGET --user username:password 'https://127.0.0.1:443/v3/scans/13/objects?format=html' -OJ
    

    Same thing but download in JSON-LINES format:

    $ curl -k -XGET --user username:password https://127.0.0.1:443/v3/scans/13/objects?format=jsonl -OJ
    

    GET /scans/{scan_id}/objects?format=fmt

    You may download scan results in multiple formats. See Scan reports for their description:

    format value Description
    html Interactive drill-down HTML report.
    names Report of "Affected Persons".
    audit Audit log for this scan, including a timestamp for each accessed object.
    csv Detailed PII report as CSV.
    jsonl Detailed PII report as JSON-LINES (one JSON file per line).
    json Detailed PII report as one huge JSON object. Not recommended because of RAM footprint; use jsonl instead.
    xlsx Detailed PII report as an Excel spreadsheet.
    xlsx_simple Simplified PII report as an Excel spreadsheet.

    You can download reports even while a scan is in progress. The report will contain partial results.

    To download an aggregated report from multiple scans, submit multiple comma-separated scan_ids, e.g. GET /scans/1,5,20/objects?format=jsonl.

    Pause and resume scan

    Pause a running batch scan with ID 55:

    $ curl -k -XPUT --user username:pwd https://127.0.0.1:443/v3/scans/55 -H 'Content-Type: application/json' -d'{"status": "PAUSED"}'
    

    PUT /scans/{scan_id}

    Pause a running scan with {"status": "PAUSED"}, or run a paused scan with {"status": "RUNNING"} payload.

    Trying to pause a scan that is not running, or run a scan that is not paused, will return an error response with no effect on the scan.

    Delete scan

    Delete all data for the batch scan with ID 13:

    $ curl -k -XDELETE --user username:pwd https://127.0.0.1:443/v3/scans/13
    

    DELETE /scans/{scan_id}

    Once you don't need the results of a scan any more, it is recommended you delete it to get rid of its persisted sensitive data, free up disk space and speed up analytics.

    List all scans

    To list all existing batch scans (inventory indexes):

    $ curl -k -XGET --user username:password https://127.0.0.1:443/v3/scans/
    

    GET /scans/

    List all existing scans. Each listed scan is in the format described in Batch status.

    Duplicate a scan

    For convenience, PII Tools supports functionality for duplicating a scan. This enables you to launch a new scan with the exact same parameters as an existing scan, so you don't have to configure it from scratch again.

    When using the web interface, click the "Duplicate scan" icon. This icon is in the "Actions" column next to each existing scan.

    duplicate scan screenshot

    API Endpoint

    Retrieve information from an existing batch scan with id 13:

    curl -k -XGET --user username:password https://127.0.0.1:443/v3/scans/13
    

    Use the config parameter from the response to pre-populate and POST a new scan.

    To achieve this functionality using the REST API, first retrieve the config of an existing scan with GET /v3/scans/{scan_id}. The relevant parameters can be read from the config field in the response. Use these parameters to pre-populate POST request parameters and launch a new scan with POST /v3/scans/.

    Resume a scan

    Sometimes scans fail, for various reasons – a broken network connection, the scanned device goes offline, server restarts, etc. For this case, PII Tools includes functionality for resuming a scan conveniently. This saves time because you don't have to scan again from scratch.

    To resume a batch scan, click the "Resume scan" icon under "Actions":

    resume scan screenshot

    How does resuming a scan work, behind the scenes?

    1. Create a new, empty scan. This will be the "resumed scan".
    2. Copy scan results of all files that scanned successfully in the original scan (before it failed) into this new scan.
    3. In the new scan, continue scanning the remaining files plus re-scan files that FAILED in the original scan.
    4. Once the resumed scan completes, you can safely delete the original (failed) scan if you wish. This will free up disk space and speed up analytics.

    Continue scanning from a FAILED scan:

    API Endpoint

    curl -k -XPOST --user username:password https://127.0.0.1:443/v3/scans/13
    

    POST /scans/{scan_id}

    Launch a new batch scan and continue scanning from an existing scan scan_id. Runs asynchronously. The request will return immediately; see Batch status for checking the scan progress.

    The response will contain scan_id assigned to this newly launched scan. Use this scan ID in all REST API operations related to this scan: when querying the scan progress, deleting the scan, etc.

    Stream scans

    Scan a single PDF file:

    $ curl -k -s --user username:password -XPOST https://127.0.0.1:443/v3/stream_scan -H 'Content-Type: application/json' -d'
    {
        "filename": "bank_form.pdf",
        "content": "'$(base64 -w0 /tmp/bank_form.pdf)'"
    }'
    

    This request will generate a JSON response similar to this:

    {
        "status": "SCANNED",
        "processing": {
            "_time": 0.2773430347442627,
            "_time_children": 0.2770969867706299,
            "_time_self": 0.0002460479736328125,
            "language": "en",
            "language_confidence": 1.0,
            "severity": "3-CRITICAL"
        },
        "pii": [
            {
                "confidence": 1.0,
                "pii": "Mustafa Abdul",
                "context": ", From : Name : Mustafa Abdul The Branch Manager Address :",
                "pii_category": "Personal",
                "pii_type": "name",
                "position": 105
            },
            {
                "confidence": 1.0,
                "pii": "2201 C Street NW I Washington, DC 20520",
                "context": "Abdul  \nThe Branch Manager                                 Address: 2201 C Street NW I Washington, DC 20520 \nBank of America                                 Phone No",
                "pii_category": "Personal",
                "pii_type": "address",
                "position": 181
            },
            {
                "confidence": 1.0,
                "pii": "GL28 0219 2024 5014 48 ",
                "context": "Account Transfer  \nA/c No. GL28 0219 2024 5014 48",
                "pii_category": "Financial",
                "pii_type": "bank_account",
                "position": 387
            }
        ],
        "storage": {
            "content_type": "application/pdf",
            "doctype": "pdf",
            "filename": "bank_form.pdf",
            "filesize": 47375,
            "location": "bank_form.pdf"
        },
        "errors": [],
    }
    

    POST /stream_scan

    Scan a given file and return the detected PII right away.

    To run a stream scan, encode the file content into Base64 encoding and include the encoded string as the content parameter.

    For selecting which PII detectors to use in the scan and additional tuning parameters, see Scan configuration. If you don't specify detectors, all available detectors will be used (including custom detectors, if any).

    Unlike a batch scan, the request will block until the response is ready (synchronous). In case the file to be scanned is large, or an archive or mailbox, use the asynchronous batch scan instead to avoid timeouts.

    Returns

    The returned metadata fields are:

    • "status": <str> – Scan status of this file. One of PENDING, SCANNING, SKIPPED, SCANNED, FAILED.
    • "pii": <Array[Object]> – List of all detected PII. Each hit includes the actual detected instance, its context, confidence and position in the original document.
    • "storage": <Object> – The file's metadata taken from the original storage, such as its file size, location, owner, permissions, last modified date etc. Different data storages offer different metadata.
    • "processing": <Object> – Additional non-PII file attributes inferred from its content, such as the document's language or severity level.
    • "errors": Array<Object> – List of errors that occurred while scanning this file. If a file was SKIPPED or FAILED, you'll find the reason here.

    Scheduler

    To create a scheduled scan from the API, use a standard Launch Batch Scan POST request with an extra schedule parameter:

    $ curl -k -XPOST --user username:password https://127.0.0.1:443/v3/scans -H 'Content-Type: application/json' -d'
    {
        "scan_type": "s3",
        "scan_name": "S3 backups",
        "storage_parameters": {
            "aws_secret_access_key": "AKIA1234567890123456",
            "aws_access_key_id": "abCD1234567/qB6",
            "bucket": "contract_backups"
        },
        "schedule": {
            "start": "2020-05-17 15:00",
            "repeat": "monthly",
            "end": "2021-01-01 21:15"
        }
    }'
    

    PII Tools allows for scheduling scans to run in the future. This is useful for:

    1. Deferred scans: instead of launching a scan now, launch it at a specified time and date.
    2. Recurring scans: Have a scan run repeatedly at a specified date and time. For example run daily, weekly, monthly etc.

    To view or delete your existing schedules, go to the Scheduler tab in the left-hand menu:

    scheduler

    To create a new schedule, fill in the Schedule scan section of the Launch New Scan or Create New Schedule window:

    schedule scan

    Parameter          Type Description
    start String Mandatory. Date and time to first run the scheduled scan. Example: "2020-05-17 4:00".
    repeat String Mandatory. How often to run the scan. Example: "quarterly".
    "never" Run just once, at the time and date specified in start. Effectively a "deferred scan".
    "daily" Run every day at the time specified in start.
    "weekly" Run once a week on the same time and day of the week as start. For example, if start is a Sunday 4:00, the scan will run every Sunday at 4am.
    "monthly" Run once a month on the same time, day of the week and week of the month as start. For example, if start is the third Sunday of the month, the scan will run every 3rd Sunday of each month at 4am.
    "quarterly" Same as monthly, but run every third month.
    "yearly" Run once a year, on the same date and time as specified in start.
    end String Optional. Schedule stops after this date, no more scans are run. If not specified, will run scans indefinitely. Example: "2021-05-17 11:00".

    Any newly created scan that has the "Schedule scan" section filled in will automatically become a scheduled scan.

    To turn a regular existing scan into a scheduled scan:

    1. Click its "Duplicate Scan" button on the Analytics tab.
    2. Fill in the desired schedule.
    3. Hit the "Add schedule" button at the bottom of the form.

    Conversely, to run an existing scheduled scan out-of-order, as a regular scan right now:

    1. Click its "Run scan now" action button on the Scheduler tab.
    2. Avoid filling in the "Schedule scan" section.
    3. Hit the "Start scanning" button at the bottom of the form.

    SAR Analytics

    PII Tools indexes all discovered file metadata internally which allows you to search, filter and export selected records by concrete PII, file size, file name, file owner etc. This is especially useful for collecting information in order to answer GDPR Data Subject Access Requests (SAR), and for identifying affected and high-risk files for auditing.

    analytics screenshot

    The reported file metadata includes detailed information on:

    • each detected PII instance
    • the context of each detected PII instance
    • the position of each PII instance
    • the detection confidence of each PII instance
    • severity classification of the entire scanned file
    • additional storage metadata of each scanned file (e.g. its size, location, owner, permissions, etc)

    Analytics Dashboard

    To use Analytics from the PII Tools web dashboard, go the Analytics tab.

    You'll see a page that lists all your scans, both running and completed. In case you have many scans, use the pagination buttons at the bottom to navigate between pages. Or use the search bar on top and enter "Scan name" to look files from a specific scan.

    For example, click the search bar on top, select Scan name from the drop-down menu, and type fileshare + ENTER. The view will change, showing you files from all scans where the scan name contains the word fileshare.

    level 1 screenshot

    To list all objects that contain a specific personal information, select the metadata field you want to match in the drop-down menu, and then type the value you wish to search.

    Examples:

    • Select Person name, type John Smith, and press ENTER. The web view will change to show all files that contain the name "John Smith".

    • To "search for objects that contain a credit card number": select PII, Financial, Credit card number and EXISTS.

    • Some metadata fields also support querying by the count of detected PII instances. For example, to find all files that contain more than two home addresses, click inside the Search bar on top and select PII, Personal, Home Address, >, type 2 and press ENTER.

    level 2 screenshot

    For each displayed file, you can inspect the actual PII by clicking the "Show detailed report" button under Actions:

    level 3 screenshot

    Analytics REST API

    Run an analytics query from the REST API, download the result as CSV:

    $ curl -XPOST --user username:pwd https://127.0.0.1:443/analytics -H 'Content-Type: application/json' -OJ -d'
    {
      "output": "csv",
      "async": false,
      "query": {
        "scan_ids": ["1"],
        "scan_name_patterns": ["*"],
        "or_clauses": [
            [
                ["any", "CONTAINS", "john"],
                ["severity", "CONTAINS", "CRITICAL"]
            ]
        ],
        "sort": "start_time",
        "limit": 20,
        "offset": 0
      }
    }'
    

    The Analytics API can be used to search over scans and return a list of matching files programmatically. This list is returned in any of the supported formats: HTML drill-down, CSV, JSON, JSON-LINES, Excel spreadsheet or Audit log.

    Endpoint

    POST /analytics

    Run analytics search and return matched objects, in the selected response format.

    Note that the method is POST (not GET), because the parameter payload can be potentially large and we avoid huge URLs for technical reasons.

    Input (JSON)

    Field             Type Description Example
    query Object Query that selects desired files across the entire inventory index. See below. {}
    output String Export output format: one of {json, jsonl, csv, html, xlsx, xlsx_simple, audit}. See Scan reports).
    async Boolean If true, return an HTML page that refreshes periodically until the generated report is ready. If false, wait until the report is fully generated and return it directly as the response. false

    The query parameter specifies fine-grained criteria for object matching. See the sample query on the right for an example. query supports the following fields:

    query key       Type Description Example
    scan_ids List[String] List of scan ids to search in. If not specified, search in all scans. "scan_ids": ["1"]
    scan_name_patterns List[String] List of scan names to search in. If not specified, search in all scans. Special * wildcard character will match any substring. "scan_name_patterns": ["*"]
    or_clauses List[List[List[String]]] A list of search filters. A file will be matched if at least one of the OR clauses matches. See example on the right.
    sort String How to sort the response. One of {object_id, status, enqueued, ended, severity, doctype, language, location, filename, filesize, last_modified}. status
    limit Integer Pagination: Return limit number of matched and sorted files, starting at the index offset." 20
    offset Integer Pagination: Return limit number of matched and sorted files, starting at the index offset." 0

    The search uses a combination of one or more OR clauses. A file matches and will appear in the result if:

    • At least one of the OR clauses matches.
    • Each OR clause is a combination or one or more AND clauses. If all AND clauses match, the whole OR clause matches.
    • AND clauses are of the form (metadata_key, operator, value) or (metadata_key, EXISTS). Any PII instance or storage parameter is a valid metadata_key. The full list of supported metadata keys can be retrieved via GET /v3/analytics/_field_mapping.

    Supported AND operators are:

    • EXISTS: match if the given key exists in the object
    • CONTAINS: match if the given key contains the search value
    • CONTAINS_CASE same as CONTAINS but case-sensitive
    • EQUALS: match if the given key matches exactly the search value
    • EQUALS_CASE: same as EQUALS but case-sensitive
    • >, <, =, <=, >=: match if the integer value (count)

    For example, or_clauses = [["name", "CONTAINS", "John"], ["file_age", ">", "5"]] contains a single OR clause, which is comprised of two AND clauses. It will match all files that contain the name "John" AND are older than 5 hours.

    Returns

    A list of all matched objects in output format:

    • jsonl: Return all matched objects in JSON-LINES format (one object per line).
    • json: Return all matched objects in JSON format (all objects in one huge JSON array). Takes up a lot of RAM; prefer jsonl instead, it's more efficient.
    • csv: Return all matched objects in CSV format.
    • xlsx: Return all matched objects in Excel XLSX format.
    • xlsx_simple: Return all matched objects in simplified Excel XLSX format.
    • audit: Return all matched objects in audit CSV format.
    • html: Return all matched objects as an interactive drill-down report.

    Each returned object contains several fields, including detected PII, its context, severity and storage metadata; see Scan report for the description of the returned file metadata.

    Retrieve File Metadata

    Get all indexed metadata for one file:

    $ curl -k -s --user username:pwd -XGET https://127.0.0.1:443/v3/scans/1/objects/1
    

    Example response:

    {
        "scan_id": "1",
        "object_id": "1",
        "scan_name": "s3 small",
        "status": "SCANNED",
        "ended": "2019-07-25 14:43:12.704326",
        "enqueued": "2019-07-25 14:43:10.822782",
        "errors": [],
        "pii": [
            {
                "confidence": 1.0,
                "context": ", From : Name : Mustafa Abdul The Branch Manager Address :",
                "pii": "Mustafa Abdul",
                "pii_category": "Personal",
                "pii_type": "name",
                "position": 105
            },
            {
                "confidence": 1.0,
                "context": "Account Transfer  \nA/c No. GL28 0219 2024 5014 48",
                "pii": "GL28 0219 2024 5014 48 ",
                "pii_category": "Financial",
                "pii_type": "bank_account",
                "position": 418
            }
        ],
        "processing": {
            "_time": 1.592280626296997,
            "_time_children": 1.5919265747070312,
            "_time_self": 0.0003540515899658203,
            "language": "en",
            "language_confidence": 1.0,
            "severity": "3-CRITICAL"
        },
        "storage": {
            "content_type": "application/pdf",
            "doctype": "pdf",
            "filename": "bank_form.pdf",
            "filesize": 47134,
            "last_modified": 1543349581.0,
            "location": "my_bucket/bank_form.pdf",
            "owner": "johndoe",
            "storage_type": "s3"
        }
    }
    

    It is also possible to retrieve metadata for a single object, given its id.

    API endpoint:

    GET /v3/scans/<scan_id>/objects/<object_id>

    Retrieve full metadata for the given file, uniquely identified by its scan id + object id.

    Input

    Field Type Description
    scan_id String Scan identifier. Note that this is the scan id, not scan name.
    object_id String Object identifier as it appears in reports.

    Output:

    Object metadata with status 200 if all OK, or {"error": "error text"} and a corresponding HTTP status in case of a failure.

    Each returned object contains several fields, including detected PII, its context, severity and storage metadata; see Scan report for the description of the returned file metadata.

    Custom detectors

    You can define your own custom patterns to discover with each scan, in addition to the built-in detectors that come out of the box with PII Tools.

    Examples of custom patterns include organization-specific information such as "student ID" or "contract number". These patterns are called custom detectors, and when matched, will appear in the scanning results alongside other detections.

    Unlike the built-in detectors that use machine learning, the custom detectors are simpler, using regular expressions to define what to match ("instance regexp"), plus what must appear nearby the instance for the match to be valid ("context regexp").

    In the web interface, use the "Custom detectors" tab in the left menu. For adding/deleting custom detectors programmatically, see the REST API endpoint documentation below.

    custom detector screenshot

    Example of a custom PII detector for a 6-digit student id:

    {
      "pii_type": "student_id",
      "pii_category": "other",
      "instance_regexps": ["\\bID[0-9]{6}\\b"],
      "context_regexps": ["student"],
      "severity": "LOW",
      "ignore_case": true
    }
    

    How it works

    1. Each custom detector is run alongside the standard out-of-the-box detectors on the text of each scanned object. Images are ignored and do not affect custom detectors.

    2. When a potential PII candidate instance is found matching any of the instance_regexps rules, its context (surrounding text, column headers) is checked using the context_regexps rules. Unless at least one of context_regexps matches, the candidate is discarded.

    3. If a candidate instance passes the context check, this PII instance is indexed just like any other PI, and will appear in the Scan report. The severity you provided (e.g. LOW in the example above) will be combined with the severity of other PIs detected in this object, to assign the final severity for the entire object.

    Custom detector parameters

    Parameter Type Description Default
    pii_type String Name of the detector. Use lowercase_with_underscores. -
    pii_category String PI category. Other
    instance_regexps List[String] Candidate PIs must match at least one regexp in this list. - (mandatory parameter)
    context_regexps List[String] Candidate contexts must match at least one regexp in this list. No context checking if empty. []
    severity String Severity level to assign to each hit. One of LOW, HIGH, CRITICAL. -
    ignore_case Boolean Ignore text upper/lower case when matching. true

    Add a custom detector

    Add a new detector named my_detector:

    curl -k -XPOST --user username:pwd https://127.0.0.1:443/v3/detectors/custom -H 'Content-Type: application/json' -d'
    {
      "pii_type": "student_id",
      "pii_category": "Other",
      "instance_regexps": ["\\bID-[0-9]{6}\\b"],
      "context_regexps": ["student"],
      "severity": "LOW",
      "ignore_case": true
    }'
    

    You can define new custom detectors using either the web interface, or programmatically using the REST API.

    API endpoint

    POST /v3/detectors/custom

    See the example to the right for a REST API example. This example detector will look for patterns like ID-0123456 inside any file. The pattern is ID- followed by 6 digits, and delimited by word boundaries from either side, so that words like PID-01234567 won't match.

    In addition, we require the word student must appear nearby, otherwise the match is discarded. Note that we didn't put the word boundary around student here, so that words like "student", "students", "student's" etc will pass the context check too.

    Since we defined ignore_case to be True, letter casing is ignored. Both id- and ID- or Id- will match, and any of Student, STUDENTS etc will pass the context check.

    After you've created your custom detector, use it in REST API scans by entering its pii_type name into the optional detectors field during scan configuration.

    List all existing detectors

    Get a list of all custom detectors:

    curl -k -XGET --user username:pwd https://127.0.0.1:443/v3/detectors/custom
    

    Response:

    {
        "_request_seconds": 0.012,
        "_success": true,
        "custom_detectors": [
            {
                "context_regexps": [
                    "student"
                ],
                "context_window": 5,
                "id": "3",
                "ignore_case": true,
                "instance_regexps": [
                    "\\bID-[0-9]{6}\\b"
                ],
                "pii_category": "Other",
                "pii_type": "student_id",
                "severity": "LOW",
                "threshold_fullmatch_lower": 0.0,
                "threshold_fullmatch_upper": 1.0,
                "threshold_mismatch_lower": 0.0,
                "threshold_mismatch_upper": 1.0,
                "threshold_partialmatch_lower": 0.0,
                "threshold_partialmatch_upper": 1.0
            }
        ]
    }
    

    Get a list of all custom detectors.

    Endpoint

    GET /v3/detectors/custom

    Output

    Field Type Description
    custom_detectors List[Object] List of all user-defined custom detectors.

    Delete a custom detector

    Delete the custom detector with id 3:

    curl -k -XDELETE --user username:pwd https://127.0.0.1:443/v3/detectors/custom/3
    
    {
        "_request_seconds": 0.046,
        "_success": true
    }
    

    Permanently delete a custom detector.

    Endpoint

    DELETE /v3/detectors/custom/{id}

    Input

    Field Type Description
    id String Id of the custom detector to remove.

    Output: JSON with 200 status if all OK, or {"error": "error text"} if something went wrong.

    Migrate custom detectors

    If you need to transfer your custom detectors between different PII Tools installations, export them from one PII Tools instance and import into another.

    The export is a single .json file which you can conveniently move between installations; see Export / import.

    Exclusions

    Some PII detections may be undesirable – either because they're wrong (false positives), or because that particular PII instance is not relevant to the current review.

    For example, during a breach incident investigation, you may want to hide known employee names, so that only the breached customer names appear in your reports.

    Such undesirable PII detections can be hidden from reports on a case-by-case basis, in a process called exclusions.

    exclusions

    How it works

    1. Each exclusion consists of a rule and a note. The rule is a regular expression ("regex") applied to each PII instance and context, in all scans and all files. If the rule matches, the PII is not displayed in reports.
      • Optionally, you can also fill in a note for each exclusion. This note is not used for matching, nor is it displayed anywhere. Its use is solely as your internal note, such as Employee name, don't show this to customers --John 28/5/20, to keep things tidy.
    2. All exclusions are applied at the time of report generation. That is, the PIIs are still detected during a scan, but excluded PIIs are not displayed later in Analytics and in scan reports.
      • This means that if you change your mind later and delete an exclusion, the PII hidden by that exclusion will re-appear again in your reports.
    3. To manage exclusions, navigate to the "Exclusions" tab in the left-hand side menu. Here you can create a new exclusion, edit existing, or delete exclusions. You can also add exclusions directly from Analytics; see Add new exclusion.

    Add new exclusion

    There are two ways to add exclusions: from an existing detection in Analytics, and from the Exclusions tab.

    From Analytics

    1. Using Analytics, navigate to a file that contains the unwanted PII.
    2. Click the "Exclude" button next to the PII instance to be hidden.
    3. In the menu that appears, select either "Exclude this instance" or "Exclude this instance in this exact context":
      • "Exclude this instance" will hide all PII that matches this instance text. For example, if you "Exclude this instance" on an instance of PII name John Doe, then John Doe will disappear from files, emails, database reports.
      • "Exclude this instance in this exact context" will hide all PII that matches not just the instance, but also its exact context. This allows you to hide a name only in one file (one context), while keeping the same name visible in another file (another context). exclusions
    4. The dashboard will refresh and you will no longer see the excluded PII. Note that other files may be affected too, in case the new exclusion rule also applies to them.

    From scratch

    1. Navigate to the "Exclusions" tab in the left-hand side menu.
    2. Click the "Create new exclusion" button in the top-right corner.
    3. Enter the desired rule and note. exclusions
    4. Click the "Create exclusion" button to submit and store the exclusion.

    Create a new exclusion. The returned id is 18 in this example:

    $ curl -k -XPOST --user username:pwd https://127.0.0.1:443/v3/exclusions -H 'Content-Type: application/json' -d'
    {
        "rule": ".*Branch Manager.*",
        "note": "example note"
    }'
    

    The response will look like this:

    {
       "_request_seconds":0.009,
       "_success":true,
       "id":"18",
       "note":"example note",
       "rule":".*Branch Manager.*"
    }
    

    API endpoint

    To create a new exclusion programmatically:

    POST /v3/exclusions

    The payload accepts JSON with two mandatory parameters:

    Parameter Type Description Default
    rule String Regexp. Any PII whose instance or context matches this regexp will be hidden from reports. -
    note String Any text; for your internal use. -

    See the curl code on the right for one example POST call.

    Edit exclusion

    API call to update an existing exclusion:

    $ curl -k -XPUT --user username:pwd https://127.0.0.1:443/v3/exclusions/18 -H 'Content-Type: application/json' -d'
    {
        "rule": ".*Branch Manager.*",
        "note": "example note"
    }'
    
    {
       "_request_seconds":0.009,
       "_success":true,
       "id":"18",
       "note":"example note",
       "rule":".*Branch Manager.*"
    }
    

    To edit an existing exclusion:

    1. Navigate to the "Exclusions" tab in the left-hand side menu.
    2. Use the search bar on top to filter down all existing rules to just the ones you wish to edit. You can enter words or parts of text to make your search easier. The search works over both rules and notes.
    3. Click the pencil button under "Actions". A new window will open that allows you to adjust both the rule and the note.
    4. When finished editing, don't forget to press the "Update exclusion" button.

    API endpoint

    To update an existing exclusion programmatically:

    PUT /v3/exclusions/<id>

    The exclusion id is the same ID as returned from GET and POST requests and must be valid (not deleted).

    The PUT payload accepts the parameters as creating a new exclusions.

    List exclusions

    > curl -k -XGET --user username:pwd https://127.0.0.1:443/v3/exclusions
    
    {
       "_request_seconds":0.006,
       "_success":true,
       "limit":100,
       "offset":0,
       "rules":[
          {
             "id":"16",
             "note":"Created from 'credit_card_ip.pdf' on Mon, 24 Aug 2020 17:47:17 GMT",
             "rule":"^20.152.182.237$"
          },
          {
             "id":"15",
             "note":"John Doe",
             "rule":"^John Doe$"
          }
       ],
       "total_count":2
    }
    

    You can list your existing exclusions under the "Exclusions" tab in the left-hand side menu.

    exclusions

    For your convenience, there's a search bar on top that allows you to filter exclusions by a word or part of text. Only exclusions with a rule or note that match your search will be displayed.

    API endpoint

    To list an existing exclusion programmatically:

    GET /v3/exclusions/

    The response is in JSON format. See the curl example to the right for a sample output.

    Delete exclusion

    curl -k -XDELETE --user username:pwd https://127.0.0.1:443/v3/exclusions/1

    To delete an existing exclusion:

    1. Navigate to the "Exclusions" tab in the left-hand side menu.
    2. Use the search bar on top to filter down all existing rules to just the ones you wish to delete. You can enter words or parts of text to make your search easier. The search works over both rulesa and notes.
    3. To delete an exclusion, click the garbage bin button under "Actions". Confirm the pop-up asking you whether you're sure.

    API endpoint

    To delete an existing exclusion programmatically:

    DELETE /v3/exclusions/<id>

    The exclusion id is the same ID as returned by GET and POST requests.

    Migrate exclusions

    If you need to transfer your exclusions between different PII Tools installations, export them from one PII Tools instance and import into another.

    The export is a single .json file which you can conveniently move between installations; see Export / import.

    Export / import

    PII Tools offers functionality to customize your installation, such as by adding Custom PII detectors and PII Exclusions. These customizations are local to that one installation, but sometimes you might want to migrate them to another installation, another PII Tools server.

    Typical reasons for migrating the service state include:

    1. PII Tools product upgrade that is not backward compatible.
    2. To keep multiple PII Tools servers in sync, including their custom state, for load balancing.
    3. As a backup.

    PII Tools supports these workflows through export / import.

    Export

    Export the state of this instance into a single JSON file:

    $ curl -k -XGET -JLO --user username:pwd https://127.0.0.1:443/v3/state
    
    Saved to filename 'pii-export-2020-08-26-12:32:58.json'
    

    To export the state of your instance from your web dashboard, click the "Export" button in the ⓘ information panel.

    export import buttons

    The export will produce a single .json file which contains all the state information. You can store, archive, and later import this file into another instance.

    API endpoint

    To export the state of PII Tools programmatically:

    GET /v3/state

    The response will be a file attachment in the JSON format, which you can store or rename for later use. See on the right for a curl example.

    Import

    $ curl -k -XPOST --user username:pwd https://127.0.0.1:443/v3/state -H 'Content-Type: application/json' -d @'pii-export-2020-08-26-12:32:58.json'
    
    {
       "_request_seconds":0.007,
       "_success":true,
       "custom_detectors":{
          "created":0,
          "updated":0
       },
       "exclusions":{
          "created":0,
          "updated":3
       }
    }
    

    To import custom detectors and exclusions, click the "Import" button in your dashboard and select a previously exported .json file.

    export import buttons

    API endpoint

    To import the state of another PII Tools instance programmatically:

    POST /v3/state

    The POST payload must be a valid export file, in the .json format from Export. See on the right for a curl example.

    Scan reports

    Scanning results can be accessed in three ways:

    1. An interactive HTML drill-down report, meant to be reviewed by humans.
    2. A structured JSON, CSV and Excel-spreadsheet export of all detected personal records, meant to be processed by computers and integrated into compliance platforms.
    3. A interactive HTML with the names of all affected people. This reports is primarily for the breach incident workflow.

    All types of reports can be downloaded using the Download report API, from the SAR Analytics or through the web dashboard button under "Actions" in each scan.

    download actions

    Drill-down report

    These reports are interactive HTML web pages at three successively finer levels of resolution:

    1. Summary page (index.html)
      • Summarizes overall PII statistics by file type (PDF, CSV, archive etc), PII type and Severity.
    2. Listing page
      • Files and directories that match search criteria, grouped by location.
      • Filter by severity, file type and PII type.
      • Listing is a table that provides metadata about the matching file: file name, location, size, file type, severity, PII types.
    3. File page
      • Details about the PII detected in a particular file, with PII instances highlighted in context.

    The report can be downloaded as a ZIP page archive from the web UI, or using the Download report API.

    Summary Report

    JSON report

    Example of one JSON line (reformatted for easier reading):

    {
        "scan_id": "1",
        "object_id": "1",
        "scan_name": "s3 small",
        "status": "SCANNED",
        "ended": "2019-07-25 14:43:12.704326",
        "enqueued": "2019-07-25 14:43:10.822782",
        "errors": [],
        "pii": [
            {
                "confidence": 1.0,
                "context": ", From : Name : Mustafa Abdul The Branch Manager Address :",
                "pii": "Mustafa Abdul",
                "pii_category": "Personal",
                "pii_type": "name",
                "position": 105
            },
            {
                "confidence": 1.0,
                "context": "Account Transfer  \nA/c No. GL28 0219 2024 5014 48",
                "pii": "GL28 0219 2024 5014 48 ",
                "pii_category": "Financial",
                "pii_type": "bank_account",
                "position": 418
            }
        ],
        "processing": {
            "_time": 1.592280626296997,
            "_time_children": 1.5919265747070312,
            "_time_self": 0.0003540515899658203,
            "language": "en",
            "language_confidence": 1.0,
            "severity": "3-CRITICAL"
        },
        "storage": {
            "content_type": "application/pdf",
            "doctype": "pdf",
            "filename": "bank_form.pdf",
            "filesize": 47134,
            "last_modified": 1543349581.0,
            "location": "my_bucket/bank_form.pdf",
            "owner": "johndoe",
            "storage_type": "s3"
        }
    }
    

    To access the detected information in a computer-friendly way, without the HTML formatting and summaries, you can download it in the standard JSON format. This format is convenient for further processing or integration.

    The JSON record schema (see an example to the right):

    Field                   Type Description
    scan_id String ID of the scan this file belongs to. Uniquely identifies a scan.
    scan_name String Name of the scan this file belongs to. Multiple scans can have the same name.
    object_id String ID of the object. Uniquely identifies each object.
    status String Scan status of this file. One of PENDING, SCANNING, SKIPPED, SCANNED, FAILED.
    pii List[Object] List of all detected PIIs. Each element includes the actual detected instance under pii, its context under context, detection confidence in confidence, instance character offset in the original document as position, and pii_type and pii_category as the type and category classes of the instance.
    storage Object The file's metadata taken from the original storage, such as its file size, location, owner, permissions, last modified date etc. Different data storages offer different metadata.
    processing Object Additional non-PII file attributes inferred from its content. Includes auto-detected language and severity level.
    errors List[Object] List of errors that occurred while scanning this file. If a file was SKIPPED or FAILED, you'll find the reason here.

    Excel report

    The xlsx Excel export format is compatible with spreadsheet software such as Microsoft Excel, Google Spreadsheets, or OpenOffice.

    The Excel export will contain one metadata field per line, for each exported object. To group metadata fields by object, sort the spreadsheet by the object_id column.

    Excel export

    Simple Excel report

    This is a simplified Excel report. Each sheet row corresponds to one object (file, email, SQL table…), and contains only summary information about the object location, severity and what types of PII were detected inside. The actual PII instances are not listed.

    Use this format if you don't need as much detail as in the JSON or Full Excel reports.

    CSV report

    Use the csv export format for a flat listing in a widely supported plain-text format. Each CSV row represents one metadata item of one object:

    CSV format column Description
    scan_name Name of the scan this file belongs to.
    scan_type Storage type of the scan (S3 bucket, endpoint device, SQL database, etc).
    object_id Unique identifier for this file.
    category Category of the metadata key: processing, storage, or a PI category like Personal or Financial.
    field Name of the metadata key, e.g. location, credit_card, filesize etc.
    value Value of the metadata key, e.g. 2-HIGH for severity, or 109308 for filesize, or my_bucket/csv/metrics.csv for location on an S3 scan.
    pii_context Context surrounding the PI instance. Only present in PII rows.
    pii_position Character offset of the PI instance in the file. Only present in PII rows.
    pii_confidence Detection confidence of the PI instance. Only present in PII rows.

    This format is very similar to the Excel report format, but in a flat .csv file rather than a formatted .xslx Excel file.

    Affected persons

    This report is similar to the interactive drill-down report, but focuses on presenting data from the perspective of individual people.

    The interactive report has three layers:

    1. Summary page (index.html)
      • How many people appear in the data? Who are they?
      • Each person's name is listed, along with information about how many files contain that name.
    2. Listing page
      • For each name, a list of all locations that contain this name.
    3. File page
      • Full details about the PII detected in a particular file, including the name and all other PII information.

    The report can be downloaded as a ZIP archive from the web UI, or using the Download report API.

    affected persons

    Audit log

    An audit report is a detailed listing of all files accessed during a scan, no matter their scanning result. FAILED and SKIPPED files are included too, along with timestamps of access and error messages (if any).

    The report is a CSV file, with one file per line. The CSV columns are as follows:

    Audit format column Description
    scan_name Name of the scan this file belongs to.
    scan_type Storage type of the scan (S3 bucket, endpoint device, SQL database, etc).
    object_id Unique identifier for this file.
    location Full location of this file.
    scan_started When was this file put into the scanning queue.
    scan_ended When was the processing of this file finalized.
    status Scan status of this file. One of PENDING, SCANNING, SKIPPED, SCANNED, FAILED.
    severity Automatically assigned severity level classification for this file.
    note Notes and error messages associated with this file.

    Available PII types

    These are the concrete personal, sensitive and intimate data types PII Tools can detect:

    PII Category PII Type Example instance Note
    Financial credit_card 3547011095740842 VISA, MASTERCARD, MASTERCARD_NEW, AMEX, CHINA T_UNION, CHINA UNION_PAY, DINERS, DINERS_2, DINERS/ENROUTE, DISCOVER, RUPAY, INTER_PAYMENT, INTER_PAYMENT_2, MAESTRO, DANKORT, MIR, JCB, LASER, SWITCH, TROY, UATP, VERVE, SOLO, FORBRUGSFORENINGEN
    Supported language context: Any.
    Financial bank_account RS39 2712 7251 5923 5161 28
    Supported language context: EN, PT, FR, BR, NL, SA, PL, CS.
    Financial routing_number 111000012 ABA, Sort code, BSB, SWIFT, CA Transit Number
    Supported language context: EN, DE, FR.
    Sensitive race Asian
    Supported language context: EN.
    Sensitive gender Female
    Supported language context: EN, PT, BR, ES, NL, SA, PL, CS.
    Sensitive religious_views about consciousness are generally shunned as psudo-scientific heretics by the hard science community. Conciousness is a meta-physical or philosophical concept.</p>\n\n<p>"I think, therefore I am." is the only proof that consciousness exists that I am aware of. Therefore, you cannot even prove that a person other', "a program that simulates the results of consciousness?</p>\n\n<p>I don't believe that you can program conscious AI, nor could you prove that you have done so. Consciousness isn't something that can ever be marketed. You can only market the AI on the basis of it's
    Supported language context: EN.
    Sensitive sexual_preference It's only recently that I've come out to myself as being bisexual and learning to not just tolerate it but honor it.
    Supported language context: EN.
    Personal name Sean Connery Full name
    Supported language context: Any.
    Personal address San Raton, California 99109 Full address
    Supported language context: Any.
    Personal face [59, 51, 112, 112] Profile picture (person's face) bounding box coordinates
    Supported language context: Any.
    Personal date_of_birth 1962
    Supported language context: EN, PT, BR, DE, FR, ES, IT, NL, SA, TR, RO, PL, CS.
    Personal phone 408.555.1296
    Supported language context: EN, PT, BR, TR, PL, CS, DE, ES, NL, SA.
    Personal email john.arnold@enron.com
    Supported language context: Any.
    Personal city Adams Only available for structured data (CSV, XLS, SQL etc)
    Supported language context: EN, FR, DE, ES, PT, BR, NL, SA.
    Personal country USA Only available for structured data (CSV, XLS, SQL etc)
    Supported language context: EN, PT, BR.
    Personal country_code SN Only available for structured data (CSV, XLS, SQL etc)
    Supported language context: EN.
    Personal first_name Garth Only available for structured data (CSV, XLS, SQL etc)
    Supported language context: EN, FR, PT, BR, NL, SA, PL, CS, TR.
    Personal last_name Stofko Only available for structured data (CSV, XLS, SQL etc)
    Supported language context: EN, FR, PT, BR, NL, DA, PL, CS, TR.
    Medical health Patient Information Name: Monica Latte Patient ID: 0000-44444 Birth Date: 04/04/1950 Gender: Female Marital Status: Divorced Problems: DIABETES MELLITUS (ICD-250.) HYPERTENSION, BENIGN ESSENTIAL (ICD-401.1) Medications: PRINIVIL TABS 20 MG (LISINOPRIL) 1 po qd Last Refill: #30 x 2 : Carl Savem MD (08/27/2010) HUMULIN INJ 70/30 (INSULIN REG & ISOPHANE (HUMAN)) 20 units ac breakfast Last Refill: #600 u x 0 : Carl Savem MD
    Supported language context: EN.
    Medical health_id 1234-123-123-AZ Medicare number or equivalent (USA, Canada, Australia, UK NHS, France CV)
    Supported language context: EN, FR.
    Medical icd G44.311 World Health Organization ICD codes (version 9, 10, 11)
    Supported language context: EN.
    Security ip 25.27.159.60
    Supported language context: EN.
    Security username UserID: MNETTEL
    Supported language context: EN, NL, SA, DE, ES, PT, BR, FR, IT, RO, CS, PL.
    Security password password: enron4
    Supported language context: EN, NL, SA, DE, ES, PT, BR, FR, IT, RO, CS, PL.
    National id_scan scan or photograph (image) Digital scans or camera snapshots of passports and other personal IDs with a machine-readable zone (MRZ). Reported context equals the X,Y coordinates of the passport within the image.
    Supported language context: Any.
    National driving_licence 609-53-5588 US states, Canada, Australia, UK, France
    Supported language context (unstructured): EN, FR, PT, BR, NL, SA, PL.
    Supported language context (structured): EN, FR, PT, BR, NL, SA, PL, RO, ES, DE, TR.
    National passport CX2345678 International passport numbers: EU, USA, Canada, Japan, Korea
    Supported language context: EN, FR, PT, BR, DE, NL, SA, PL, KR, JP, ES, RO, TR, RU.
    National tax_id 988-88-8889 National Tax ID or equivalent (USA TIN, UK UTR, NINO, Australia TFN, Canada SIN, EU VAT, Brazil CPF, German Steuernummer, Spain NIF)
    Supported language context: EN, BR, FR, DE; all EU (VAT).
    National ssn 296-12-3298 Social security number or equivalent (USA SSN, Canada SIN, UK NINO, Australia TFN, France CNI, INSEE, NIR)
    Supported language context: EN, FR.