Near-Duplicate Detection

Back to: Document Imaging & Coding

Statistics have shown that every collection of documents contain between 25% and 50% near-duplicates, which are similar documents containing formatting and/or textual differences. These are distinct from duplicates, which are exact copies. Near-duplicates include files with:

  • A percentage of textual differences (by far the most common)

  • Variances in formatting (such as bold or italicized fonts)

  • Different file types (such as an MS Word file converted to PDF)

The return-on-investment on near-duplicate detection is unquestionable. Industry studies have shown that the cost of legal document review has a significant impact on the overall cost of litigation. When near-duplicate documents are not identified or grouped together, there is a significant risk that similar documents (paper or electronic) will be reviewed multiple times by different lawyers resulting in wasted time, extra cost and the risk of subjective coding inconsistencies. Near-duplication detection costs pennies per document and allows lawyers to review similar documents in groups dramatically increasing the speed of document review while lowering the associated costs by 25% or more.

The Ricoh eDiscovery near-duplicate solution, which identifies both duplicate and near-duplicate documents, is used with electronic data, scanned/OCR'd collections or a combination of both to identify and group documents prior to a full document review. The results are then output in a suitable format for many standard litigation support software such as AD Summation, FTI Ringtail, iConect and Concordance.

 

Get Started Today