Ricoh eDiscovery

Load Files: Encoding Basics and Key Delimiters

Posted by Michael Truelove |5 minute read

Jun 23, 2020 1:06:42 PM

Tuesdays_Tip_15_2019-10-21

Load files are specifically formatted files that contain links to native documents, images and the OCR/Full Text of a document. As the name suggests, they are used to "load" documents processed in an eDiscovery tool to get the data into a review-able format.

Just like all computer files, load files are completely made up of ones and zeros. So how do we go from just two characters to the entirety of the human language — plus all the symbols that are used? Today I’ll be sharing the different types of encoding, key delimiters and common issues related to delimited files to better understand how load files are formatted which will help you navigate how to edit them, and select a preferred file type.

Types of Encoding

There are two main groups of encoding. The main difference between the two is in the way they encode characters and the number of bits that they use for each.

ASCII (American Standard Code for Information Interchange):

  • The most common format for text files in computers and on the internet.
  • Though it appears to use groupings of seven ones and zeros, it actually uses eight digits. A single one or zero is a "bit" and a "byte" is made up of eight bits.
  • The first bit will always be zero but there are extended ASCII encodings that use all eight bits.

ASCII worked well for a long time, but eventually there was a need to extend the encoding to include non-English languages and CJK (Chinese, Japanese, Korean) characters.

Here is a breakdown of the ASCII codes:

Excel snap

Click to view larger.

UTF-8 (8-Bit Unicode Transformation Format):

  • Created to be able to use non-English language and CJK characters.
  • Uses variable bit encoding.
  • Where standard ASCII uses only 7-bits, UTF-8 uses that eight bit to say when a character will use another eight bits. That second set of eight bits can do the same thing, making a character use 24 bits, and so on.

UTF-8 has two major advantages. First, if only ASCII characters are used, then there is no difference between UTF-8 and ASCII. Second, because of the variable bit length of characters, it allows the character set to be massive and cover pretty much any character one could want.

Delimited Files

Now that we know how the ones and zeros can represent characters, how do we get them to represent different fields and documents in a load file? This is done by creating a "delimited" file. There are three key delimiters used in these files:

  • Document Delimiter: Tells us we're moving from one document to the next document. This is typically a line break which is both a Carriage Return and a New Line (0D 0A in Hex or 0000 1101 0000 1010 in Binary). Most applications don't allow this character to be changed.
  • Field Delimiter: Used to say that we're moving from one field to another.
  • Quote Delimiter: Required because the Field Delimiter might be used inside the data within the field. For example, a CSV is a form of a delimited file, with a comma as the Field Delimiter and a quote as the Quote Delimiter. Because it's not unusual to have a comma in the data within a field, the quote is necessary to tell the application the comma inside the quote isn't a field delimiter. However, it's not unusual to see quotes in the text in a field of a CSV.

Other possible delimiters

Any character can be used as the delimiters (where possible), but there are a few other common formats:

  • Tab delimited files which use a tab (09 in Hex or 0000 1001 in Binary).
  • Pipe/carat delimited files which use a pipe (| or 7C in Hex, or 0111 1100 in Binary) and a carat (^, or 5E in Hex, or 0101 1110 in Binary).

The standard in litigation load files has become a "Concordance" DAT file (named by Concordance because it was the litigation support software that first started using it as a standard), in which the characters it uses as delimiters are highly unlikely to exist within the text of a field. In fact, the Field Delimiter isn't even a printable character (in Hex it's 14 and in Binary it's 0001 0100). The Quote Delimiter is a character that doesn't exist in the standard ASCII set, but does exist in Extended ASCII which uses all 8-bits of the character. Because of this, the Quote Delimiter is actually coded differently in ASCII than it is in UTF-8, by the character þ. In ASCII, it's FE in Hex or 1111 1110 in Binary whereas in UTF-8, it's C3 BE in Hex and 1100 0011 10111 1110 in Binary.

Issues when field and Quote Delimiters are in the field text

So, how do programs read a delimited file if there are quotes and commas in the field data?Applications start reading from the beginning of the file and will first come upon a Quote Delimiter. This tells the application to ignore any Field Delimiters until after the next Quote Delimiter. Anything that's not another Quote Delimiter will be treated as text in that field. When it hits the Quote Delimiter, the application knows it's the end of the text for that field and will ignore any further characters until it hits a Field Delimiter. Here are some examples in CSV format:

  • "Jenny said "That's not okay"","Parent",""
    • The system will only save "Jenny said " in the field, then ignore the rest of the text until it hits the next comma, where it would then put "Parent" in the next field.

If quotes and comma's exist in a field:

  • "Jenny said "That's not okay, you shouldn't do that"","Parent", ""
    • The system will save "Jenny said " in the first field,
    • " you shouldn't do that" in the second field,
    • “Parent” in the third field,
    • and fourth field will be created but left empty. 
      • Some programs will notice the change in field count in the line and give a warning or error.

Through this example we can see why CSVs are not an ideal load file format since commas and quotes get used in the text of fields all the time and can cause issues when trying to load data. DAT load files limit these errors and are therefore recommended and preferred.

Subscribe to our blog for future tips on how to manually change encoding of a load file and some tricks for editing them. If you have any questions about Delimited File Formats, get in touch with us today.


You may also be interested in...

Tuesdays_Tip_12_2019-10-21Everything You Need to Know About Production Protocols

Find out how establishing a formal method of exchanging documents off the bat can save you time, money and headaches.



Topics: Tuesday's Tip

   

Tell Us What You Think.