Skip to content

Latest commit

 

History

History
103 lines (84 loc) · 8.64 KB

File metadata and controls

103 lines (84 loc) · 8.64 KB

PII Access Control S3 Object Lambda function

This serverless app helps you to control access to PII (Personally Identifiable Information) from valid text files present in S3. This app deploys a Lambda function which can be attached to S3 object lambda access point.
The lambda function internally uses AWS Comprehend to detect PII entities from the text .

App Architecture

Architecture Diagram

Lambda function is optimized to leverage Comprehend's ContainsPiiEntities.

  1. Lambda function is invoked with a request containing information about the S3 object to get and transform.
  2. The request contains a S3 presigned url to fetch the requested object.
  3. The data is split into chunks that are accepted by Comprehend’s ContainsPiiEntities API and call the API with each chunk.
  4. The responses are aggregated from all chunks.
  5. Lambda function callsback S3 with the response i.e either the text data or throws exception if the file contains PII.
  6. If any failure happens while processing, Lambda function returns an appropriate error response to S3 which will be returned to the original caller.
  7. Lambda function returns with 0 exit code .i.e. with out any error if no error occurred else would fail.

Installation Instructions

  1. Create an AWS account if you do not already have one and login
  2. Go to the app's page on the Serverless Application Repository
  3. Provide the required app parameters (see parameter details below) and click "Deploy".

Parameters

Following are the parameters that you can tune to get desired behavior

Environment variables

Following environment variables for Lambda function can be set to get desired behaviour

  1. LOG_LEVEL - Log level for Lambda function function logging, e.g., ERROR, INFO, DEBUG, etc. Default: INFO.
  2. UNSUPPORTED_FILE_HANDLING Handling logic for Unsupported files. Valid values are PASS and FAIL (Default: FAIL). If set to FAIL it will throw UnsupportedFileException when the requested object is of unsupported type.
  3. IS_PARTIAL_OBJECT_SUPPORTED Whether to support partial objects or not. Accessing partial object through http headers such byte-range can corrupt the object and/or affect PII detection accuracy. Valid values are TRUE and FALSE. Default: FALSE.
  4. DOCUMENT_MAX_SIZE_CONTAINS_PII_ENTITIES Maximum document size (in bytes) to be used for making calls to Comprehend's ContainsPiiDocument API for classifying PII entity types present in the doc Default: 50000.
  5. PII_ENTITY_TYPES : List of comma separated PII entity types to be considered for access control. Refer Comprehend's documentation page for list of supported PII entity types. Default: ALL which signifies all entity types that comprehend supports.
  6. SUBSEGMENT_OVERLAPPING_TOKENS : Number of tokens/words to overlap among segments of a document in case chunking is needed because of maximum document size limit. Default: 20.
  7. DOCUMENT_MAX_SIZE : Default maximum document size (in bytes) that this function can process otherwise will throw exception for too large document size.
  8. CONFIDENCE_THRESHOLD : The minimum prediction confidence score above which PII classification and detection would be considered as final answer. Valid range (0.5 to 1.0). Default: 0.5.
  9. MAX_CHARS_OVERLAP : Maximum characters to overlap among segments of a document in case chunking is needed because of maximum document size limit. Default: 2.
  10. DEFAULT_LANGUAGE_CODE : Default language of the text to be processed. This code will be used for interacting with Comprehend . Default: en.
  11. CONTAINS_PII_ENTITIES_THREAD_COUNT : Number of threads to use for calling Comprehend's ContainsPiiEntities API. This controls the number of simultaneous calls that will be made from this Lambda function. Default: 20.
  12. PUBLISH_CLOUD_WATCH_METRICS : This determines whether or not to publish metrics to Cloudwatch. Default: true.

Runtime variables

You can add following arguments in S3 object lambda access point configuration payload to override the default value configured used by the Lambda function . These values would take precedence over environment variables. Provide these variables as a json string like the following example

    ...
    "payload": "{\"pii_entity_types\" : [\"CREDIT_DEBIT_NUMBER\"],\"mask_mode\":\"MASK\", \"mask_character\" : \"*\",\"confidence_threshold\":0.6,\"language_code\":\"en\"}"
    ...

Use these parameters to get desired behaviors from different access point configuration attached to the same lambda function.

  1. pii_entity_types : List of PII entity types to be considered for redaction. e.g. ["SSN","CREDIT_DEBIT_NUMBER"].
  2. confidence_threshold :The minimum prediction confidence score above which PII classification and detection would be considered as final answer.
  3. language_code: Language of the text. This will be used to interact with Comprehend.

App Outputs

Successful response

In case the text file contains PII, it would be redacted and returned in response to GetObject API output.

Error responses

Lambda function would forward the standard S3 error responses it will receive while downloading the file from S3.

Further following error responses will be thrown by Lambda function:

Status Code Error Code Error Message Description
BAD_REQUEST_400 InvalidRequest Lambda function has been incorrectly setup An incorrect configuration which restricts lambda function to even start handling the incoming events
BAD_REQUEST_400 UnexpectedContent Unsupported file encountered for determining PII This error would be thrown in case caller tries to get an invalid utf8 file (e.g image) and UNSUPPORTED_FILE_HANDLING variable is set to FAIL
BAD_REQUEST_400 EntityTooLarge Size of the requested object exceeds maximum file size supported This error would be thrown in case caller tries to get an object which is beyond the max file size supported
BAD_REQUEST_400 RequestTimeout Failed to complete document processing within time limit This error would be thrown in case lambda is not able to complete the processing of the document within the time limit. This could be because your file size is too big or you are getting throttled by either S3 or Comprehend.
INTERNAL_SERVER_ERROR_500 InternalError An internal error occurred while processing the file Any other error occurred while processing the object
FORBIDDEN_403 AccessDenied Document Contains PII The the requested document has been inferred to contain PII

Metrics

Metrics are published after each invocation of the lambda function and are a best effort attempt (Failures in CloudWatch metric publishing are ignored)

All metrics will be under the Namespace: ComprehendS3ObjectLambda

Metrics for processed document

MetricName Description Unit Dimensions
PiiDocumentsProcessed Emitted after processing a document that contains pii Count S3ObjectLambdaAccessPoint, Language
DocumentsProcessed Emitted after processing any document Count S3ObjectLambdaAccessPoint, Language
PiiDocumentTypesProcessed Emitted after processing a document that contains PII for each type of PII of interest Count S3ObjectLambdaAccessPoint, Language, PiiEntityType

Metrics for Comprehend operations

MetricName Description Unit Dimensions
Latency The latency of Comprehend DetectPiiEntities API Milliseconds Comprehend, DetectPiiEntities
Latency The latency of Comprehend ContainsPiiEntities API Milliseconds Comprehend, ContainsPiiEntities
ErrorCount The error count of Comprehend DetectPiiEntities API Count Comprehend, DetectPiiEntities
ErrorCount The error count of Comprehend ContainsPiiEntities API Count Comprehend, ContainsPiiEntities

Metrics for S3 operations

MetricName Description Unit Dimensions
Latency The latency of S3 WriteGetObjectResponse API Milliseconds S3, WriteGetObjectResponse
Latency The latency of downloading a file from a presigned S3 url Milliseconds S3, DownloadPresignedUrl
ErrorCount The error count of S3 WriteGetObjectResponse API Count S3, WriteGetObjectResponse
ErrorCount The error count of downloading a file from a presigned S3 url Count S3, DownloadPresignedUrl

License Summary

This code is made available under the MIT-0 license. See the LICENSE file.