Dust off your work boots! We’re going berry picking.
Imagine that we have arrived at a blueberry field along the Northumberland Strait in the Maritimes on a comfortable August morning to pick berries. Our objective is to collect every blueberry in the field, as economically as possible. Unfortunately, the blueberries must be picked by hand. No blueberry rakes are available. Thankfully, a winnowing machine, which uses vibration and fans to separate blueberries from twigs and leaves, is at our disposal.
Allow me to present an analogy I formulated to describe the interplay of information governance (IG), document collection and data reduction strategies. In this analogy, the blueberry field represents all data in a party’s possession. The blueberries represent potentially or likely relevant documents, while the twigs and leaves (and anything else collected) represent documents that lack relevant content. And our collective document reduction strategies — from processing and Early Case Assessment to pre-review analysis and managed review — serve as the winnowing machine.
To imitate common difficulties in collection and the consequences that result, I am introducing two impediments to our berry picking: blindfolds and oven mitts. These impediments reflect deficiencies in IG and collection capabilities. More specifically, the blindfold depicts data uncertainty — an unawareness of the ‘what’ and ‘where’ with respect to a party’s data. The oven mitts reflect poor collection capabilities that prevent the identification and isolation of potentially relevant documents without crudely capturing all proximate data.
|Blueberry field||An organization’s data|
|Blueberries||Potentially, or likely, relevant documents|
|Twigs and leaves||Documents that lack relevant content|
|Winnowing machine||Data reduction strategies|
|Blindfold||Ineffective Information Governance (IG)|
|Oven mitts||An inability to identify potentially relevant documents|
To revisit our objective (with greater specificity), we are to return home with baskets containing every reasonably accessible blueberry in the field (high recall) and with reasonably few twigs and leaves (high precision). And to explicate recall and precision in this analogy: recall is indifferent as to the prevalence of twigs or leaves in the baskets, while precision is indifferent as to whether berries remain in the field. Many berries remaining in the field signifies low recall and many twigs and leaves in the baskets indicate low precision.
1. Oven Mitts and Blindfolds (Low Precision, Low Recall)
Both impediments are present in this initial scenario. We cannot easily locate the blueberry bushes due to the blindfolds (low recall) and when we do, it is difficult to identify the blueberries with the oven mitts covering our hands. As such, an abundance of twigs and leaves — as well as bees, lilac, goldenrod and even blueberry maggot flies — accompany the berries into the baskets (low precision). While the winnowing machine will address the twigs and leaves, and most of the other bycatch, the blindfold impedes us from determining whether blueberries remain in the field (low recall). We must decide, without validation, as to whether our baskets of blueberries are sufficient or whether we should clumsily search further.
The eDiscovery analogues:
- Such a collection would be due to the absence of IG policies and a haphazard and/or limited collection, largely driven by uncertainty as to what data exists and where data is stored.
- In practice, data uncertainty is rarely as severe as caused by the blindfold in the scenario above. However, some degree of impairment is common without robust IG (and, even then, blind spots often persist due to new communication channels or novel cloud-based storage options.).
- Like the winnowing machine, data reduction strategies and the review process itself, will address collected rubbish. However, the collection of excess data — that which is unlikely to yield relevant content — incurs avoidable cost due to hosting the additional volume and the extra time and effort needed to winnow down the collection.
- Incomplete or flawed collections are costly if the entire process — collection to production —must be repeated due to production deficiencies.
2. Blindfolds, No Oven Mitts (Low Recall, High Precision)
In the second scenario, we proceed blindfolded into the blueberry field. Without oven mitts we have the full benefit of our tactile sense and finger dexterity. Few twigs and leaves, or other bycatch, will accompany the berries into the baskets (high precision). In fact, the winnowing machine might be of little benefit due to the care exercised in picking the berries. Unfortunately, the blueberry bushes are still difficult to locate. And, as with the first scenario, we are uncertain as to the volume of blueberries remaining in the field (low recall).
The eDiscovery analogues:
- This type of collection would reflect a targeted or curated collection consisting of select folders of key custodians and/or relying upon highly specific search terms.
- Most of the collected documents will have relevant content (high precision). However, these documents may represent a fraction of the relevant content throughout an organization (low recall).
- Although this could be a problematic collection due to low recall, its defensibility might not be an issue in cases with modest damages or if the collection process was agreed upon in a discovery plan.
- A rich collection (high relevance rate) may negate pre-review data reduction efforts and the utility of review-stage technologies, like Active Learning.
3. Oven Mitts, No Blindfolds (Low Precision, High Recall)
We proceed to pick blueberries with only oven mitts in the third scenario. The blueberries are easily identified without blindfolds and we can confidently determine when every blueberry, or substantially so, has been captured (high recall). Like the first scenario, many twigs and leaves are in the baskets due to the oven mitts (low precision), but the bycatch is far less in relative terms (the absolute volume of bycatch might be greater, but the relative volume would assuredly be less) and the more egregious chaff, like lilac and goldenrod, are absent. Once again, the winnowing machine will be relied upon to remove the twigs and leaves.
The eDiscovery analogues:
- A comparable document collection would see most of the likely relevant content captured, though a sizable volume of not relevant content would also need collection to achieve its high recall.
- A high recall outcome is often favoured over a more limited or targeted collection to lessen the risk of missing relevant content despite higher data volumes. In other words, risk reduction is often favoured over cost minimization.
- The low precision would be addressed through data reduction strategies and the review process itself.
4. No Blindfolds, No Oven Mitts (High Recall and High/Lower Precision)
Lastly, we pick blueberries with the full use of our hands and eyes. We can elect to be meticulous in our picking of the blueberries, ensuring few twigs and leaves are captured (high precision). Or, to be more productive and still maintain a high recall, we can more rapidly grasp the blueberries despite the seizing of twigs and leaves (lower precision than meticulous picking, but higher than with oven mitts). The winnowing machine will be relied upon to separate. And we can confidently cease picking when we are satisfied that every reasonably accessible blueberry has been captured (high recall). Overall, we have the luxury of employing whatever strategy we deem most effective.
The eDiscovery analogues:
- Robust IG allows the determination of the most appropriate collection strategy, on a case-by-case basis. A collection could be targeted (high precision) or more expansive (higher recall and lower, though not low, precision, with reliance upon data reduction and review to increase precision) as is appropriate.
- Collections are often more expansive if undertaken because of a legal dispute or regulatory request. Legal risk is reduced by allowing counsel to make potential relevance determinations, instead of the IT department.
- Precision issues, if any, are once again addressed by data reduction strategies and the review process.
Our objective was to return home with every blueberry in the field (high recall) and to limit the quantity of twigs and leaves (high precision). The goal was better achieved in each subsequent scenario, with the fourth scenario, in which you can deliberately devise a collection strategy, being optimal. The next best scenario, and perhaps more representative, is no blindfold (or only partial impairment) and oven mitts.
Ian Sinclair specializes in eDiscovery and has directed all stages of the discovery process — from initiating legal holds and conducting document collections to presiding over the legal review process. He is dedicated to leveraging technology to improve workflows, diminish costs, and faithfully recommends new software functionality to his clients.