Identification Woes

Identification of relevant information is obviously key to prosecuting or defending a case, but in today’s world the volume of data being generated even by small organizations makes that process a difficult one. Worse yet, the volume itself has become a metric many use to determine the quality of the methodology used for culling the dataset.

Why shouldn’t volume be used to determine whether a specific culling method is working properly?

Simply put, volume is only one part of the puzzle and although volume does equate to dollars moving through the review process, it does not provide insight into what is moving through the process. Without the “what” the parties don’t know whether a search term brought back 10,000 key documents or 10,000 Yahoo spam mail ads. If that key detail isn’t known, the parties likely lack a reasonable basis for modifying the term and they probably don’t know how to modify the term to get the best results.

So how do we get from too much data to a reasonable and proportional set of likely relevant documents?

There really are many ways. Both tools and people can be deployed in various ways to cull a particular dataset. In every situation when deciding on a methodology, the parties involved need to consider timing, volume, budget, type(s) of data and the posture of the matter. Not every solution needs to be complex or expensive.

One simple solution when using keywords (as in the example above) is to add a sampling protocol to the process for testing search terms.

Coupling a sampling protocol to standard keywords can give you great insight into the data you are working with, and without much in terms of overhead (cost or time). Creating a random set of the documents that hit a specific term will familiarize you with the data and the specifics about how that term is interacting with the data. Understanding your client’s data and why a term is bringing back particular documents is key to understanding how or whether to revise that term.

For example, we had a client involved in patent litigation over a particular chipset. As is common in the industry, the client had used a codename for the chipset during design. That codename was suggested as a search term. Seemingly this would be a good term, however, in practice, sampling showed the codename was shared by the street name of one of the client’s offices. As a result, every single email with a signature block from that office was returned by the search.

These kinds of issues are not uncommon when terms are drafted in a vacuum (i.e. without access to the documents). It is also not uncommon for parties to agree to search terms before they know what will happen when they apply them to the client’s data. This can put a client at significant disadvantage in having to renegotiate terms after having already agreed to them. Worse, it can force a greater volume of data (and particularly irrelevant data) into the broader process.

By sampling the data these issues can be quickly identified.

Looking at only the volume, without sampling, there’s little basis to properly modify the term to get to what the parties intended. Running the term and seeing the volume may have indicated a problem, but sampling a few documents can quickly educate the reviewer as to what the problem is and how to modify the term to pull in relevant documents. It also provides a reasoned basis should a dispute arise over the modifications made to a particular term.

In the end, whether the process is simple or not doesn’t matter, what does matter is that the client’s data is understood so that a reasonable basis to rely on the process chosen can be articulated. The same is true for any culling methodologies utilizing date ranges, file types, file sources or machine learning.