Ricoh eDiscovery

The Four Prerequisites of Automatic Redaction (Complete With Beyoncé Examples)

Posted by Ian R. Sinclair |8 minute read

Feb 24, 2021 4:33:56 PM


Redaction of privileged content or personal information is commonplace before documents are produced to an external party. Manual redaction has predominated throughout my 10 years in eDiscovery, but, as of late, automatic redaction tools are gaining ground. Yet, in my experience with the current tools, automatic redaction should be relied upon in limited (or specialized) circumstances and, often, require extensive eyes-on quality control.

Would automatic redaction be effective in your situation? The answer is a definitive ‘yes’ if your matter meets the four prerequisites detailed below. The further your situation diverges from these ideal conditions, your decision as to whether an automatic redaction tool is appropriate will be dictated by your risk appetite for over-redaction (the redaction of non-privileged content) and/or under-redaction (when personal information is not redacted).

1. High Precision and High Recall Terms

Automatic redaction is effective if redaction is required of terms with high precision and high recall.

The redaction of certain types of information can be well-suited for automatic redaction if the terms have high precision and high recall. This is often the case with certain types of personal information.

Take for example the name “Beyoncé Knowles” or “Beyoncé”. “Beyoncé”, as a search term, has high precision and high recall such that the term will likely always refer to the singer (high precision) and will likely capture most, if not all, references to the singer (high recall), especially if coupled with her nicknames of “Queen B” or “Bey”.

High Precision, in this context, means that searching for a term will only provide sought-after results. “Beyoncé” search results will likely only pertain to the desired singer because “Beyoncé” is an uncommon name.

High Recall, in this context, means that searching for a term, or related terms, will capture all references to the target. “Beyoncé” search results will likely capture most references to the desired singer. Pairing “Beyoncé” with “Queen B” would improve recall further, without affecting precision. And while pairing “Beyoncé” with “Bey” or “Knowles” would also improve recall, doing so would likely reduce precision somewhat (due to the risk of references to another Knowles or someone using “Bey” as a term of endearment).

Most names and terms, however, lack one (or both) of high precision or high recall. To my fellow Better Call Saul aficionados, precise terms or names, like “Hamlin Hamlin McGill” and “Kim Wexler”, may have questionable recall because references to ‘HHM’ or to ‘Kim’ will be missed. Meanwhile, high recall terms or names, like “Kim” and “Hamlin”, may lack precision if unintended Kims (likely) or Hamlins (less likely) are captured.

The automatic redaction of high precision terms ensures that the redacted content consists of privileged content or personal information, and not of non-privileged or non-personal information. The automatic redaction of high recall terms ensures that all privileged content or personal information is redacted, and not missed.

Unfortunately, privileged content cannot be easily encapsulated using search terms alone. While counsel or law firm names (high precision) and privilege terms (high recall) are commonly employed to identify privileged content, these terms and names are merely signifiers of potentially privileged content. The terms, themselves, are not of importance. Automatic redaction of those terms and names alone would fail to obscure any meaningful content. In other words, high precision and high recall, with respect to redaction, are both absent.

One strategy to capture privileged content is to expand redaction to the line/row, paragraph or page/slide-level on a term-specific basis. Rules are created in the automatic redaction tools whereby the identification of a term could result in an entire paragraph or page being redacted. This strategy still results in under- and over-redaction in a production, but the risk of producing privileged information is dramatically reduced.

Search terms rarely exhibit both high precision and recall. Typically, recall is favoured over precision. Clients tend to prefer to capture as much privileged content or personal information as possible, though over-redaction is the unavoidable result.

2. Universality

Automatic redaction is effective if the terms always require redaction.

Automatic redaction lacks discretion. The terms provided to the tool will always be redacted, regardless of the context. This is a non-issue if every instance of “Beyoncé”, or every paragraph or sentence containing “Beyoncé”, is to be redacted. However, over-redaction is unavoidable if context matters. For instance, if content relating to Beyoncé’s time with Destiny’s Child was targeted for redaction, there would likely be inadvertent redaction of Lemonade content too. Automatic redaction cannot differentiate between Beyoncé‘s work with Destiny’s Child and her solo work, like Lemonade. As such, all “Beyoncé” content would be subject to redaction. Yes, even Homecoming, her Netflix documentary.

Context always matters in identifying privileged content, which means redaction shortcomings are unavoidable with automatic redaction of privileged content. Consider, for example, an employment lawsuit stemming from impropriety at How to Get Away With Murder’s Middleton University (MU). A production may require the redaction of paragraphs referring to “Annalise” or “Keating” to protect potential solicitor-client privilege. Automatic redaction may achieve that end, but benign MU information, as well as some decidedly, and perhaps relevant, non-benign TV interpersonal drama (and crimes), would also be obscured.

3. English-Only Documents: 

Automatic redaction is effective if the documents consist of English language content.

Many automatic redaction tools are designed with the Latin alphabet in mind. The tools might not be able to identify content comprised of characters outside of the Latin alphabet, which is especially true of logographic characters such as with Chinese, Japanese and Korean language content. Non-English privileged content or personal information is at high risk of being missed. In fact, even the name of Queen B herself, “Beyoncé”, may have to lose the accent aigu when searched.

4. Text-Only (Machine-Readable) Documents: 

Automatic redaction is effective if the documents consist exclusively of machine-readable text.

Automatic redaction tools can only identify text for redaction, and not all documents contain text that is readable. Images, graphs, charts and, often, scanned documents cannot be directly targeted for automatic redaction.

Any scanned documents or scanned text present a risk. The quality of each scan will dictate whether the Optical Character Recognition (OCR) was able to sufficiently convert the image into machine-encoded text and, in turn, whether the text can be identified for automatic redaction.

Images, graphs or charts cannot be identified for automatic redaction directly, nor are the current tools able to recognize the existence of images, graphs or charts. Text in graphs or charts will typically be identified, if not part of an image. However, the extent of each automatic redaction will be subject to the same rules as with non-graph or non-chart contexts. Consider a rule that every paragraph containing “litigation” is to be redacted. A chart with a label containing “litigation”, as in ‘Ongoing Litigation’, would see the redaction of the entire label, but not of the chart itself. For the chart to be redacted, a rule would need to stipulate that every page/slide with “litigation” should be fully redacted. Of course, extensive over-redaction would be the result in this example.

Additional Considerations:

  • Document Type: Most automatic redaction tools are optimized for prose-format text, such as emails or word processing documents. Redaction results in other document types are mixed.

For example, automatic redaction in spreadsheets might be limited to individual cells or entire rows. Columns or portions of cells cannot always be redacted. As such, a column heading may contain a privileged term and signify that the column contains privileged content. Unfortunately, only the column heading, or the row of column headings, would be redacted.

Even in instances where row redaction is sufficient, the column totals or sums would not be identified for automatic redaction. As such, the value of the redacted cell(s) could still be deciphered.

  • Document Formatting: Whenever the automatic redaction level is expanded from the term to the line/row/sentence or paragraph or page, document formatting is meaningful. Automatic redaction tools use the document formatting to identify the start and end of pages, paragraphs and lines. The formatting can easily result in an automatic redaction extending too far (over many pages or the entire document) or over too little content (only a partial redaction of a paragraph).
  • Single Criterium Limit: Automatic redaction rules are often limited to a single criterium. As noted earlier, the tools would be unable to limit paragraph redactions to those containing both “Destiny’s Child” AND “Beyoncé”. Rather, the single-criterium limit means that every paragraph containing either “Destiny’s Child” or “Beyoncé” would be automatically redacted, resulting, again, in over-redaction.
  • Missed Terms: Automatic redaction tools often rely upon their own OCR to identify terms and not the text extraction technology of review platforms. This OCR process might be less effective. As such, the automatic redaction tools may fail to identify every instance of every term. In other words, the automatic redaction tool may report fewer “Beyoncé” hits than were identified by your review platform (thus presenting a risk of under-redaction).
  • Labelling Limitations: Clients often request specific labelling to be applied to each redaction, such as “Redacted — Privileged” or “Redacted — Confidential” or “Redacted — S/C Privileged”, to identify the rationale for the redaction. Such labelling may need to be modified to be compatible with automatic redaction tools.

Consider an automatic redaction rule whereby any paragraph containing “privileged” is to be redacted and, in turn, each redaction is to be labelled as “Redacted — Privileged”. This rule would result in an endless loop: the tool would identify each label as requiring redaction, then redact, then once again identify the label as requiring redaction, then redact and so on. As such, the redaction label would need to be eliminated or changed, perhaps to “Redacted - Priv”, to avoid this outcome.

To recap, automatic redaction is a highly effective tool when (1) high precision and high recall terms require (2) universal redaction and the documents (3) only contain English-language content and (4) were never scanned and contain no images, charts or graphs. The risk of over-redaction and/or under-redaction increases as conditions diverge from this ideal. This does not mean that automatic redaction tools cannot be used in non-ideal circumstances. Rather, the cost savings upfront could offset the reduced effectiveness and the risk of inadvertent disclosure and/or over-redaction complaints. And, importantly, many of these shortcomings can be mitigated through quality control measures. 

Automatic redaction tools hold great promise for cost savings and are effective in many circumstances. But, the limitations should not be downplayed (be wary!). Automatic redaction should be used to redact the Beyoncés in your haystack, but careful consideration is needed before relying upon such tools to identify and obscure the privileged content of a Kim Wexler or an Annalise Keating. To avoid future costs and headache, confront your tolerance for over- or under-redaction (or that of your client) and the risk entailed at the outset.

Have questions? Get in touch with us today.

Ian Sinclair specializes in eDiscovery and has directed all stages of the discovery process — from initiating legal holds and conducting document collections to presiding over the legal review process. He is dedicated to leveraging technology to improve workflows, diminish costs, and faithfully recommends new software functionality to his clients.

Topics: Intelligent Review, Ian Sinclair, eDiscovery Solutions


Tell Us What You Think.