Identification Woes

Identification of relevant information is obviously key to prosecuting or defending a case, but in today’s world the volume of data being generated even by small organizations makes that process a difficult one. Worse yet, the volume itself has become a metric many use to determine the quality of the methodology used for culling the dataset.

Why shouldn’t volume be used to determine whether a specific culling method is working properly?

Simply put, volume is only one part of the puzzle and although volume does equate to dollars moving through the review process, it does not provide insight into what is moving through the process. Without the “what” the parties don’t know whether a search term brought back 10,000 key documents or 10,000 Yahoo spam mail ads. If that key detail isn’t known, the parties likely lack a reasonable basis for modifying the term and they probably don’t know how to modify the term to get the best results.

So how do we get from too much data to a reasonable and proportional set of likely relevant documents?

There really are many ways. Both tools and people can be deployed in various ways to cull a particular dataset. In every situation when deciding on a methodology, the parties involved need to consider timing, volume, budget, type(s) of data and the posture of the matter. Not every solution needs to be complex or expensive.

One simple solution when using keywords (as in the example above) is to add a sampling protocol to the process for testing search terms.

Coupling a sampling protocol to standard keywords can give you great insight into the data you are working with, and without much in terms of overhead (cost or time). Creating a random set of the documents that hit a specific term will familiarize you with the data and the specifics about how that term is interacting with the data. Understanding your client’s data and why a term is bringing back particular documents is key to understanding how or whether to revise that term.

For example, we had a client involved in patent litigation over a particular chipset. As is common in the industry, the client had used a codename for the chipset during design. That codename was suggested as a search term. Seemingly this would be a good term, however, in practice, sampling showed the codename was shared by the street name of one of the client’s offices. As a result, every single email with a signature block from that office was returned by the search.

These kinds of issues are not uncommon when terms are drafted in a vacuum (i.e. without access to the documents). It is also not uncommon for parties to agree to search terms before they know what will happen when they apply them to the client’s data. This can put a client at significant disadvantage in having to renegotiate terms after having already agreed to them. Worse, it can force a greater volume of data (and particularly irrelevant data) into the broader process.

By sampling the data these issues can be quickly identified.

Looking at only the volume, without sampling, there’s little basis to properly modify the term to get to what the parties intended. Running the term and seeing the volume may have indicated a problem, but sampling a few documents can quickly educate the reviewer as to what the problem is and how to modify the term to pull in relevant documents. It also provides a reasoned basis should a dispute arise over the modifications made to a particular term.

In the end, whether the process is simple or not doesn’t matter, what does matter is that the client’s data is understood so that a reasonable basis to rely on the process chosen can be articulated. The same is true for any culling methodologies utilizing date ranges, file types, file sources or machine learning.


The word “inherited” brings to mind the classic rich uncle whose lawyer shows up after his death to inform you of the riches you will soon enjoy. Unfortunately, inherited data doesn’t always feel like a gift. Instead, inheriting a case and the associated files (including review database(s)) can be more like cleaning up the red-tagged house of your hoarder uncle after his passing.

Although it can be difficult, there are strategies to make this process easier.

Inheriting a case and all the associated files is never the same experience twice. There are many variables and so the approach to unwinding the morass of information should also be flexible, however, there are a number of points to keep in mind that will help you.

First and foremost, check the dates
Although somewhat unrelated to understanding the information, it can put the project in context and controls what aspects of the ingestion need to happen first.

Take an inventory
An inventory is a simple tool that can show where holes exist in your documentation. For example, if you have the first and third sets of responses to interrogatories it makes sense that you should have the second set too. Don’t forget to check the docket to compare against your inventory.

Also, check the documents
To ensure that you have unredacted versions (to the extent appropriate).

Understand the document collection and the database
Turning to the review database, it is important that you understand not only what you have, but how you got there.

  1. How was data identified for collection; are there collection logs;
  2. How was it loaded (e.g. was global deduplication used, were there date cuts);
  3. What is the organization of the database;
  4. Where was information captured (i.e. what fields were used for what);
  5. What review methodologies were employed;
  6. How were productions tracked; are there production logs;
  7. How were clawbacks tracked;
  8. What tools were used to process data and how will that affect new data being loaded into the system going forward.


Understanding the document collection and the database is the key to having full use of the information in the database. If possible, walk through the information above with former counsel and, if applicable, the service provider to understand the database and what documentation exists.

Ideally, include your new support partner
A good support partner can help you through this process and flag potential issues so partnering with the right team is key to ensuring a smooth transition.

Inheriting a case can be a lot of work but in the end, front-loading that effort will make the case run much smoother as you get closer to trial.

How to Set up and Organize an Efficient Smart Document Review

There are many methodologies that can be employed in document review, and no single strategy is right for every set of documents. As new technologies emerge, it is important to think critically about how existing strategies can be updated. For example, traditional linear review (document by document in ingested order) doesn’t have to be a “dumb” review. Coupling a traditional review methodology with technology or even simple grouping methods can dramatically improve consistency and quality, reducing the overall cost of the review.

Documents being reviewed in the traditional fashion are naturally grouped in the order in which they are ingested into the database (using the original structure in which data was collected and processed). This standard grouping, while sufficient, doesn’t take into account the fact that similar documents may have been ingested at different times or that similar documents may exist within other custodians’ data. The fact that related documents are batched apart from one another means it is unlikely the same reviewer will be tasked with reviewing the similar documents. As with any subjective process, those reviewers may reasonably make different coding decisions on those similar documents.

When documents are coded inconsistently it can lead to larger problems in discovery, including the potential for opposing counsel to argue that the process used to identify relevant documents may not have been sufficient.




The simplest method for further grouping documents is to use the metadata associated with those documents. Grouping by date, domains or even the subject line pulls similar data sets together. By batching those groupings together, one reviewer can be assigned similarly themed documents. Not only does this increase speed, because the reviewer is seeing the same or similar documents over and over, but it also makes it less likely that these documents will be coded inconsistently from one another.


Near Deduplication

Near deduplication is not true deduplication based on hash value. Instead it is generally a process where documents are indexed and the indexed text is compared for similarities, therefore identifying near-duplicates based on text/content, not metadata. The user can set the level of similarity that is required to cause a document to be grouped with another document. Since only the text is being compared, documents that are different file types, different date ranges, or from different custodians can still be grouped together. Again, batching based on this grouping/criteria can lead to all the benefits described above.


Email Threading

Email threading is the process of pulling email conversations together. The process ties an original email to all the subsequent replies and forwards pertaining to that original email. This grouping allows the reviewer to see all the related documents in order as one conversation.


Other Grouping

While the grouping methods discussed above might work for many datasets, it is important to consider the data being reviewed to determine if these or other methodologies may work better. For example, a review with mixed file types like email and video files might need to be split and organized in multiple ways.


Regardless of the dataset or budget, clever grouping, or “smart batching,” of datasets can dramatically increase both the speed and quality of a traditional review as well as reduce overall costs.
Contact Harbor today to see how our customizable Document Review Workflows can efficiently organize your reviews, so more consistent coding decisions can be made, significantly lowering your document review costs.

ECA Meets Analytics

For trial lawyers, Early Case Assessment (ECA) has always been the process of quickly synthesizing information from multiple sources to craft an initial case strategy. This process typically involves working closely with the client to identify and interview key witnesses, to review important documents, and to develop preliminary discovery and litigation plans.

The explosion of electronic data in the 1990s had many eDiscovery software companies clamoring to develop ECA tools to better manage the data. These tools allowed legal teams to cull data using keywords, dates, and other file characteristics in an attempt to reduce and/or prioritize the files that require review. These preliminary instruments even provided some insight into potential discovery costs associated with litigating. But their utility was limited, compared to what the industry needed.

This ‘early data assessment’, while valuable, often didn’t fully help lawyers to access and analyze information. What data was truly useful, what was not? Which documents would play an essential role in formulating case strategies? Which were simply irrelevant or redundant?

Keywords proved to be an inefficient means of organizing data. They offered a glimmer of insight, but did not deliver a dynamic way for attorneys to understand a case, or the means to help evolve that understanding. Moreso, keyword search alone demonstrated a flawed method of locating potentially relevant files (in terms of both recall and precision). It tended to overlook too many important documents, and “hit” on far too many irrelevant ones. Too wide, too shallow.

Analytics, fortunately, offer the potential to bridge the gap. They enhance a lawyer’s assessment of cases through scientific analysis of the data. ‘Scientific’ being the operable word.

Today, metrics, correlations, associations, occurrences, and algorithms have come to the forefront. (Well, maybe behind the scenes.) When deployed at the ECA stage, analytics can not only inform the development of early case strategy, but it can also provide a more sophisticated means of culling data for review. This makes it great for estimating and reducing overall discovery costs, as well.

What can Analytics do for you?

Standard features found in most eDiscovery analytics tools offer these functions:

1) Conceptual Clustering – Documents are analyzed based on their text, and then complex algorithms group documents together based on their conceptual similarity. Now, related items and topics start to cluster together, for easier observation. Even though the words within the document may be different, clustering will still group documents together, if they are conceptually similar.

Practical Uses of Conceptual Clustering:

    • Find important documents quickly. Use the ‘Concept Wheel’ as a preview index to explore documents. Whether you’re trying to wrap your mind around the data or to look for a ‘smoking gun’, the Concept Clusters will give you a leg up from the very start.
    • Prioritize documents for review. Use of Concept Clusters prioritizes the most important documents and de-prioritizes the less important ones prior to review. In other words, reviewers will get assigned the most important documents first. Likely irrelevant docs fall to the bottom of the pecking order.
    • Assign reviewers with clusters in mind. Some documents may require subject-matter expertise to make proper coding decisions. Using conceptual clusters, technical documents can be assigned to the right reviewers straight off the bat.
    • Batch by Cluster. Clustering will band together conceptually similar documents (also near duplicates) when batching. This way, reviewers will receive similar documents in their batches. Having a batch full of the same type of documents (e.g., 500 emails about fantasy football) will lead to increased efficiency (speed) and more consistent (accurate) coding decisions by reviewers. Clusters deliver similarity, speed, and accuracy.
    • Find similar. This function locates a key document, then uses the analytics to find other documents that are conceptually similar to the document you’re reviewing.

2) Key-Term Expansion
– This tool first identifies conceptually related terms found in your content, and then ranks them in order of relevance. The user dictates the status, grade and order of subjects.

Practical Use:

Start with a keyword. The tool provides a list of similar, or very related, terms. The results allow reviewers to expand the search to include documents containing other near or related terms.

For example, a search for “President Roosevelt” might produce a list such as: Theodore Roosevelt, Teddy Roosevelt, Theodore Roosevelt Jr., Franklin Delano Roosevelt, FDR, Commander-in-Chief, Vice President Roosevelt, Senator Roosevelt, Assemblyman Roosevelt, Eleanor Roosevelt, the Oval Office, Office of the President, POTUS, etc.

When using key-term expansion, a reviewer searching for important documents based on keywords can conduct a much more comprehensive and defensible search. This expansion of terms will produce more meaningful and trustworthy results.

3) Conceptual Search – The tool finds documents conceptually related to a known term or phrase. Comparable documents get grouped together by their correlated concept.

Practical Use:

Imagine you’ve located a key phrase or paragraph. Now you want to find similar ones that correspond to it. Concept searching will hunt for and assemble conceptually similar documents – even if they don’t contain that exact same term(s) used in the initial search. These are documents that would not be found with keyword searching. At the same time, concept searching eliminates false positives from synonyms and polysemes. An attorney can quickly zero in on top priority documents for immediate review.

4) Email Threading – Email threading identifies emails that were once part of the same email thread (or conversation).

Practical Uses:

    • Smart Batching. Assign documents for review by email thread; this way, when a reviewer starts looking at documents they’ll see the original email, then the response, then the next email, etc. to more quickly and accurately understand the content of the conversation. This also helps reduce coding conflicts that can be created when emails from the same thread are spread across multiple reviewers.
    • Inclusive Email Identification – If an email thread goes back and forth 15 times, do you really need to read all 15? Or could you simply look at the last email, start at the bottom, and read up? This is the idea of “Inclusive” or “Unique” emails. Analytics will identify the most comprehensive emails in the thread and suppress the redundant emails. This can easily cull out 30% of the email from a data set and reduce review time & cost.
    • Quality Control – Email threads can be used to spot conflicting coding decisions. For example, if two documents are in the same email thread conversation, how is it that one is marked as Responsive and the other is Non-Responsive? A quick search will identify these conflicts.

5) Near-Duplicate Identification
– Deduplication removes documents that are 100% duplicative, but what happens when they’re only 99% similar? Near Dupe (ND) detection will identify documents that have the same words, in the same order, and group them together. This has nothing to do with conceptual similarity – it’s a literal approach to similarity. So, those emails you get every morning from Yahoo Finance that have almost exactly the same text but with a few slight differences … they’ll be grouped together.

Practical Uses:

    • Sample the Data There are times when you don’t need to look at every document, especially when they’re all very similar. Using ND groups, you can select a “representative” document to represent other, similar documents. In other words, just look at one document from each ND group, not every single one, to get an idea of its importance.
    • Smart Batching – As described above, it makes sense to assign similar documents to a single reviewer. One way to do this is to make sure all members of a ND group are given to a single reviewer; they’ll be in a better position to spot the differences between documents and to make quick coding decisions.
    • Quality Control – ND groups can be used to spot conflicting coding decisions. For example, if two documents are 99% similar, how is it that one is marked as Responsive and the other is Non- Responsive? A quick search will identify these conflicts.
    • Remove Near Dupes – “Argh,” says the Reviewer. “Almost all of the documents are the same and we’re wasting time looking through them all. Can you remove all of the Near Dupes?” The answer is yes, we can, but you need to be careful. If we remove everything that is 95% similar, who’s to say that something important isn’t included in the 5% that’s different? Bottom line: it’s risky to remove near dupes, so proceed with caution.
    • Propagate Coding Decisions – It may be possible, but risky for the same reasons stated above, to only review the “representative” documents. If the representative is Responsive, then the other documents in the same ND group should be responsive.

6) Computer Assisted Review (“Predictive Coding” or “Technology Assisted Review”)
– The goal of computer assisted review is to train the analytics tool to make consistent, reliable responsiveness decisions on large sets of data. This can vastly reduce the volume of documents human review for production.

The Harbor Difference

Harbor’s ECA workflow leverages a processing engine that’s fully integrated into our Relativity environment. It reduces the time it takes to get access to the documents, and it provides those documents in a familiar review format. Once our system ingests data, the reviewer has access to a host of traditional features such as keyword search, reporting, and powerful culling strategies that include deduplication and de-NISTing. This workflow also offers advanced options like data visualization, near-duplication detection, data pivoting, sampling, email threading, clustering and conceptual searching.

Brainspace powers Harbor’s analytics offering and enables a truly unique analytics experience. It dynamically links multiple views of data that encompass: Overview Dashboard, transparent concept search, timeline, document clusters, communication analysis, and structured data facets.

Visual Analytics

Robust tools reveal the story inside your data by using powerful, interactive visualizations–even with the largest datasets. Our Dashboard, Focus Wheel, and Communication Network Graph all link together dynamically to provide multiple perspectives on any data set, or sub-data set.

Transparent Concept Search

Truly transparent Concept search gives reviewers in ECA complete control over the power of analytics, while helping them maintain a clearer understanding. It takes the guess work out of concept expansion, and delivers a versatile, defensible platform for attorneys.

Communication Analysis

State-of-the-art social network visualization enables users to effortlessly navigate the social media graph. It reveals the content and context of conversations, posts, direction of information flow, CC, BCC, and powerful, simple, alias consolidation.

Document Classification

Our unique approach to document classification incorporates multiple active learning methods to accelerate system training, depth and recall for planning and cost analysis, and delivers best-in- class matching results. Review less and decrease costs.

Contact Harbor Litigation, today, to see how our customizable ECA workflows can accelerate case understanding, defensibly reduce data sets, and significantly lower review costs.

Managed eDiscovery Services in the Cloud: The future is Now

Managing eDiscovery in the cloud is in the future for many organizations; but for others it’s the present. Managed eDiscovery offers law firms, corporations, and government entities the tools to control both costs and processes throughout the eDiscovery lifecycle.

There are four major components to eDiscovery operations: people, processes, software, and hardware. Managed eDiscovery allows your people to implement your processes, utilizing vendor software and hardware to run your operations. When the need arises, you have access to the vendor’s expertise. In some cases, you can license software yourself, and install it on the vendor’s hardware for your use.

In a nutshell, managed eDiscovery gives you your own customized eDiscovery solution without the capital outlays, maintenance, upgrades and personnel commitments required to build it yourself.

The Evolution of Managed Services

In the recent past, most legal departments made a choice between vendor-reliance and building in-house eDiscovery capabilities. When in-house capacity was insufficient, the legal department outsourced overflow to vendors.

Many companies found vendor-reliance unacceptable. Cost-predictions were often futile pricing models, compressed data and lack of communication frequently led to invoices that far exceeded estimates. Vendor workflows didn’t always mesh with in-house processes, and “black-box” vendor services caused uncertainty and frustration in setting and meeting expectations.

In response, some legal departments sought to build their own internal eDiscovery capability. This approach had the advantage of process and workflow control. In addition, companies were able to realize cost savings, and some law firms managed to create profit centers from their eDiscovery services.

However, the required investment in technology and expertise made in-house eDiscovery too expensive for the majority of companies and firms. Others made business decisions not to go the in-house route to limit risk exposure or to focus on core offerings. Yet companies and firms without robust litigation support departments found themselves at a competitive disadvantage, and largely powerless to exercise any control over escalating eDiscovery costs.

Market Realities are Changing In-House eDiscovery

Even the companies who did build eDiscovery departments are revisiting their in-house
model because of certain market realities:

    • More complexity. The complexity of some eDiscovery processes has increased with a growing diversity of file types, and the increased diligence expected by courts.
    • Fast-growing unstructured data. The volume of unstructured data continues to grow, and much of it is potentially subject to discovery. Some firms have declined altogether to take on the custodial challenges of big data.
    • Rapid technology changes. eDiscovery technology has undergone rapid change. Fast changes require larger and more frequent ongoing investments in updated technology, along with more personnel and training.
    • Security challenges. Recent highly publicized security breaches have increased the focus on cyber-security and caused some firms to look at ways to mitigate risks.
    • Rapid scaling challenges. When a matter grows larger than expected, the eDiscovery team may find it hard to get capital expenditure approvals for rapid scaling. It may be entirely unfeasible to go through the normal channels to purchase additional hardware and software.
    • Wider attorney acceptance of the cloud. More attorneys are accepting cloud-based solutions, along with more mature offerings and enhanced infrastructure to support them.

How Managed eDiscovery Meets Challenges and Opportunities

Managed eDiscovery presents an alternative “hybrid” option for companies who outsource to vendors, as well as for companies with in-house capability. Companies with in-house litigation support departments lose nothing by adding managed services. They still leverage their experience and knowledge on future matters, maintain their existing workflows, and exercise control over their data. And they gain much lower costs without capital investments, the advantage of rapid scaling, and the ability to outsource services if and when they want to.

How Does Managed eDiscovery Work?

Managed eDiscovery is a combination of cloud computing and support services. Cloud computing is a collection of technologies that allow access to computing power through the internet, instead of an organization’s server room. Managed eDiscovery takes primary advantage of two cloud computing technologies: Software as a Service (SaaS) and Infrastructure as a Service (IaaS).

Software as a Service (SaaS)

Any software application accessible as a web page is considered SaaS. SaaS is commonly used in the legal industry for hosted review. In a pure SaaS model, the software is licensed by the vendor who also takes responsibility for all maintenance including upgrades, patches, security and redundancy. If you need your storage to quickly spike up, your SaaS vendor can ramp up your storage allocation, usually without interrupting existing processes. You pay for the additional storage only for as long as you need it.

Infrastructure as a Service (IaaS)

IaaS grants customers access to servers, routers, storage, and other computing infrastructure over the internet. These services allow companies to utilize the internet for scalable storage and processing cycles. The infrastructure is similar to co-locating equipment at an offsite data center, except you don’t have to buy the equipment. Instead you only pay for what you use, and the environment can be scaled up or down to match the uneven workflow common in e-Discovery.


Typically, an organization uses in-house resources to handle eDiscovery phases through (or up to) collection. After collection, data is uploaded to the service provider’s data center. Some vendors offer high speed FTP (or FTP-like) transfer options while large data sets are often shipped directly to the data center. Your in-house technicians can take over from there and handle any or all phases from processing through production, including the setup and project management of hosted review databases.

With Managed eDiscovery, your technicians and project managers can log into software hosted in a secure data center and perform as much, or as little, of the actual data manipulation and project management as you choose. The service provider fills in the gaps and provides technical assistance. The software can be licensed by the vendor or by you.


For corporations, Managed eDiscovery allows attorneys to push all matters through the company’s workflows in a centralized location, collaborating with outside counsel wherever they’re physically located. Data can easily be harvested once and then used in multiple matters, replicating privilege and redaction calls where appropriate.

Additionally, organizations may find it easier to budget for Managed eDiscovery, as capital expenditures typically require more layers of approval and more advance notice than an expense budget. It’s also easier to manage and predict costs and return on investment with the monthly billing of Managed eDiscovery, instead of the startup costs, depreciation, and labor associated with buying and maintaining your own hardware and software.

Perhaps most importantly, Managed eDiscovery reduces stress on your internal systems and the people who maintain them.

Harbor Litigation Solutions Managed eDiscovery

  • Complete control by your inhouse operations team
  • Greater than 57% savings over building and maintaining the hardware and software yourself
  • Ability to ramp-up quickly with no capital outlays
  • Ability to scale-down when matters come offline
  • Bank-level data security
  • Done-for-you software upgrades and patches
  • Assistance when you need it
  • Not volume based – i.e. no per-GB fees
  • Easily re-use attorney work product across matters
  • Focus on core competencies, not hardware and software