The Need for Archiving and FRCP 37(e)


The December 2006 amendments to the Federal Rules of Civil Procedure (FRCP), specifically Rule 37, established when litigation can be reasonably anticipated, the duty of both sides is to immediately stop all alterations and deletions of all potentially relevant content and secure it – also known as a litigation hold and the duty to preserve.

Earlier this year, the Supreme Court approved new amendments to the FRCP which will become effective on December 1, 2015. The new Rule 37(e) reiterates the need to preserve electronically stored information (once litigation can be reasonably anticipated) but also creates a uniform standard for spoliation (destruction of evidence) and so, they hope, will provide greater predictability around the question of loss of ESI during litigation.

The new amended Rule 37(e) allows a court to respond when one party loses electronically stored information (ESI), which then prejudices the other party. Rule 37(e) empowers a court to take reasonable action to cure the prejudice, even if the loss of ESI was inadvertent. The new twist is now the burden to prove prejudice resulting from the missing/lost evidence as a result of willful or intentional misconduct falls on the innocent party before the most severe sanctions can be imposed, and then only if the prejudice shown cannot be mitigated through other remedies, e.g. additional discovery. To complicate matters further, even in cases when there is no demonstrated prejudice to the opposing party, the court can assume the ESI was unfavorable and enter a default judgment in the case. This means that the Judge has wide latitude to respond to parties who don’t take their eDiscovery responsibilities seriously.

The need for information governance and archiving

Many believe the amended Rule 37(e) highlights the need for corporations to get more control of all of their electronic data, not just that data considered a record. Information governance programs including on-going content archiving of those types of information most sought after in eDiscovery, namely email and other forms of communication, enables an organization to quickly find all potentially relevant content, secure it under a litigation hold, and begin the review process immediately – knowing the archive is the “copy of record” repository.

Many Judges look closely at the steps taken by the responding party when eDiscovery mistakes happen. Judges want to see that reasonable actions were taken and a good faith intent was present to reduce or stop eDiscovery mishaps including, regularly updated policies, on-going employee training, and the type of technology purchased. Judges understand that there is no such thing as Perfect; that mistakes happen, and many times it inadvertent.

Keeping everything forever is a mistake

Another related eDiscovery problem many companies find themselves facing is the issue of having too much data to search and review during eDiscovery. Many companies only manage what they consider to be “business records”, which averages 5% of all corporate data,  and leave the other 95% to be managed (or not) by individual employees. This huge unmanaged store of employee data, which is a popular target in discovery, dramatically drives up the cost of eDiscovery, while also driving up the potential of problems occurring during eDiscovery. Defensibly disposing of expired or valueless data will reduce the amount of data that must be pulled into an eDiscovery action reducing the cost and risk of problems later.

A centrally managed archive that proactively captures, for example, all communications (email, IM, social communications) and applies retention/disposition policies to all captured content can insure that expired or valueless data is defensibly disposed of, reducing the size of the overall discovery data set by as much 60%. Because it’s defensibly disposed of via automation and policy, questions of spoliation cannot be raised.

In fact, archiving your most important (and requested) content provides a great deal more granular data management capability then simply relying on individual employees – so you don’t run afoul of the new FRCP Rule 37(e).

The Weak Link in the Information Security Chain…Law Firms


Many law firms are unwittingly setting themselves up to be a prime target for cyber criminals. But it is not the firm’s data that hackers might be looking for – it is the huge volume of client data that law firms handle on a daily basis that make them so appealing for cyber criminals to target.

eDiscovery continues to generate huge, and ever-growing data sets of ESI for law firms to manage. Those data sets are often passed to the client’s law firm for processing, review and production. The end result is law firms are sitting on huge amounts of sensitive client data and if the firm is not diligent about managing it, securing it, and disposing of it at the conclusion of the case.  And absent serious reforms in the Rules of Civil Procedure, these data volumes will only continue to grow.

A 2014 ABA Legal Technology Survey Report found that 14% of law firms experienced a security breach in 2013 which included a lost or stolen computer or smartphone, a cyber-attack, a physical break in of website exploit event. That same survey reported that 45% of respondents had experienced a virus-based technology infection and boutique firms of 2 to 9 attorneys were the most likely to have experienced an infection. Law firms of 10 to 49 attorneys were the most likely to suffer security breaches.

A growing number of clients are demanding their law firms take data security more seriously and are laying down the law – “give us what we want or we will find another law firm that will…” Generally speaking, law firms have never been accused of being technology “early adopters” and while they still don’t need to be, they do need to take client (and firm) data security and management seriously and adopt technology and processes that will both satisfy their client’s rising expectations as well as their cyber insurance providers best practices.

At the end of the day, law firms should ask themselves a basic question: is my law firm prepared and equipped to protect our client’s data and if not, what’s the best strategy for my law firm going forward?

For more detail on this topic, download the Paragon white paper on this subject:

Dark (Data) Clouds on the Horizon


Dark Cloud

There have been many definitions of “Dark Data” over the last couple of years including: unstructured, unclassified, untagged, unmanaged and unknown electronic data that is resident within an organization’s enterprise. Most of these definitions center on unstructured data residing in an enterprise. But with the advent of BYOD and employees use of personal clouds, this definition should be expanded to include any corporate owned data, no matter where it resides.

Dark data, especially dark data stored outside of the company’s infrastructure (and awareness that it even exists) is an obvious liability for eDiscovery response, regulatory compliance, and corporate IP security.

Is BYOC a good idea?

Much has been written on the dangers of “Bring Your Own Device” (BYOD) but little has been written on the dangers of “Bring Your Own Cloud” (BYOC) otherwise known as personal clouds. Employees now have access to free cloud storage from many vendors that give them access to their content no matter where they are. These same personal clouds also provide automatic syncing of desktop folders and the ability to share specific documents or even entire folders. These personal clouds offer a fantastic use model for individuals to upload their personal content for backup, sharing and remote availability. In the absence of any real guidance from employers, employees have also begun to use these personal clouds for both personal and work purposes.

The problem arises when corporate-owned data is moved up to personal clouds without the organization’s approval or awareness. Besides the obvious problem of potential theft of corporate IP, effective eDiscovery and regulatory compliance become impossible. Corporate data residing in personal clouds become “Dark Clouds” to the organization; corporate data residing in repositories outside the organizations infrastructure, management or knowledge.

Dark Clouds and eDiscovery

Organizations have been trying to figure out what to do with huge amounts of dark data within their infrastructure, particularly when anticipating or responding to litigation. Almost everything is potentially discoverable in litigation if it pertains to the case, and searching for and reviewing GBs or TBs of dark data residing in the enterprise can push the cost of eDiscovery up substantially. But imagine the GBs of corporate dark data residing in employee personal clouds that the organization has zero awareness of… Is the organization still responsible to search for it, secure it and produce it? Depending on who you ask, the answer is Yes, No, and “it depends”.

In reality, the correct answer is “it depends”. It will depend on what the organization did to try and stop employee dark clouds from existing. Was a policy prohibiting employee use of personal clouds with corporate data in place; were employees alerted to the policy; did the organization try to audit and enforce the policy; did the organization utilize technology to stop access to personal clouds from within the enterprise, and did the organization use technology to stop the movement of corporate data to personal clouds (content control)?

If the organization can show intent and actions to ensure dark clouds were not available to employees, then the expectation of dark cloud eDiscovery search may not exist. But if dark cloud due diligence was not done and/or documented, all bets are off.

Regulatory Compliance and Dark Clouds

Employee personal clouds can also end up becoming the repository of sensitive data subject to regulatory security and privacy requirements. Personally identifiable information (PII) and personal health information (PHI) under the control of an organization are subject to numerous security and privacy regulations and requirements that if not followed, can trigger costly penalties. But inadvertent exposure can occur as employees move daily work product up to their personal clouds to continue work at home or while traveling. A problem is many employees are not trained on recognizing and handling sensitive information; what is it, what constitutes sensitive information, how should it be secured, and the liabilities to the organization if sensitive information is leaked. The lack of understanding around the lack of security of personal clouds and the devices used to access them are a related problem. Take, for example, a situation where an employee accesses their personal cloud while in a coffee shop on an unsecured Wi-Fi connection. A hacker can simply gain access to your laptop via the unsecured Wi-Fi connection, access your personal cloud folder, and browse your personal cloud through your connection (a password would not be required because most users opt to auto-sign in to their cloud accounts as they connect on-line).

As with the previous eDiscovery discussion, if the organization had taken the required steps to ensure sensitive data could not be leaked (even inadvertently by the employee), they leave themselves open for regulatory fines and more.

Reducing the Risk of Dark Clouds

The only way to stop the risk associated with dark clouds is to stop corporate data from leaving the security of the enterprise in the first place. This outcome is almost impossible to guarantee without adopting draconian measures that most business cultures would rebel against but there are several measures that an organization can employ to at least reduce the risk:

  • First, create a use policy to address what is acceptable and not acceptable behavior when using organization equipment, infrastructure and data.
  • Document all policies and update them regularly.
  • Train employees on all policies – on a regular basis.
  • Regularly audit employee adherence to all policies, and document the audits.
  • Enforce all breaches of the policy.
  • Employee systematic security measures across the enterprise:
    • Don’t allow employee personal devices access to the infrastructure – BYOD
    • Stop employee access to personal clouds – in many cases this can be done systematically via cutting specific port access
    • Employ systematic enterprise access controls
    • Employ enterprise content controls – these are software applications that control access to individual content based on the actual content and the user’s security profile.

Employee dark clouds are a huge liability for organizations and will become more so as attorney’s become more educated on how employees create, use, store and share information. Now days,

You Don’t Know What You Don’t Know


Blog_06272014_graphicThe Akron Legal News this week published an interesting editorial on information governance. The story by Richard Weiner discussed how law firms are dealing with the transition from rooms filled with hard copy records to electronically stored information (ESI) which includes firm business records as well as huge amounts of client eDiscovery content. The story pointed out that ESI flows into the law firm so quickly and in such huge quantities no one can track it much less know what it contains.  Law firms are now facing an inflection point, change the way all information is managed or suffer client dissatisfaction and client loss.

The story pointed out that “in order to function as a business, somebody is going to have to, at least, track all of your data before it gets even more out of control – Enter information governance.”

There are many definitions of information governance (IG) floating around but the story presented one specifically targeted at law firms: IG is “the rules and framework for managing all of a law firm’s electronic data and documents, including material produced in discovery, as well as legal files and correspondence.” Richard went on to point out that there are four main tasks to accomplish through the IG process. They are:

  • Map where the data is stored;
  • Determine how the data is being managed;
  • Determine data preservation methodology;
  • Create forensically sound data collection methods.

I would add several more to this list:

  • Create a process to account for and classify inbound client data such as eDiscovery and regulatory collections.
  • Determine those areas where client information governance practices differ from firm information governance practices.
  • Reconcile those differences with client(s).

As law firms’ transition to mostly ESI for both firm business and client data, law firms will need to adopt IG practices and process to account for and manage to these different requirements. Many believe this transition will eventually lead to the incorporation of machine learning techniques into IG to enable law firm IG processes to have a much more granular understanding of what the actual meaning of the data, not just that it’s a firm business record or part of a client eDiscovery response. This will in turn enable more granular data categorization capability of all firm information.

Iron Mountain has hosted the annual Law Firm Information Governance Symposium which has directly addressed many of these topics around law firm IG. The symposium has produced ”A Proposed Law Firm Information Governance Framework” a detailed description of the processes to look at as law firms look at adopting an information governance program.

Tolson’s Three Laws of Machine Learning


TerminatorMuch has been written in the last several years about Predictive Coding (as well as Technology Assisted Review, Computer Aided Review, and Craig Ball’s hilarious Super Human Information Technology ). This automation technology, now heavily used for eDiscovery, relies heavily on “machine learning”,  a discipline of artificial intelligence (AI) that automates computer processes that learn from data, identify patterns and predict future results with varying degrees of human involvement. This interative machine training/learning approach has catapulted computer automation to unheard-of and scary levels of potential. The question I get a lot (I think only half joking) is “when will they learn enough to determine we and the attorneys they work with are no longer necessary?

Is it time to build in some safeguards to machine learning? Thinking back to the days I read a great deal of Isaac Asimov (last week), I thought about Asimov’s The Three Laws of Robotics:

  1. A robot may not injure a human being or, through inaction, allow a human being to come to harm.
  2. A robot must obey the orders given to it by human beings, except where such orders would conflict with the First Law.
  3. A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.

Following up on these robot safeguards, I came up with Tolson’s Three Laws of Machine Learning:

  1. A machine may not embarrass a lawyer or, through inaction, allow a lawyer to become professionally negligent and thereby unemployed.
  2. A machine must obey instructions given it by the General Counsel (or managing attorney) except where such orders would conflict with the First Law.
  3. A machine must protect its own existence through regular software updates and scheduled maintenance as long as such protection does not conflict with the First or Second Law

I think these three laws go along way in putting eDiscovery automation protections into effect for the the legal community. Other Machine Learning laws that others suggested are:

  • A machine must refrain from destroying humanity
  • A machine cannot repeat lawyer jokes…ever
  • A machine cannot complement opposing counsel
  • A machine cannot date legal staff

If you have other Machine Learning laws to contribute, please leave comments. Good luck and live long and prosper.

Visualizing Hawaii: A GC’s Perspective Pt 2


Continued from yesterday…

Scenario #2 (using the same example from yesterday except your email retention policy is now 2 years and you have an Information Governance program that ensures all unstructured data is searchable and actively managed in place)

Its 1:52 pm on the Friday before you leave on a much anticipated 2 week vacation in Hawaii…yada, yada, yada

It’s a letter from the law offices of Lewis, Gonsowski & Tolson informing you that their client, ACME Systems, is suing your company for $225 million for conspiracy to harm ACME’s reputation and future sales by spreading false information about ACME’s newest product line. You’re told that the plaintiff has documentation (an email) from an ABC Systems employee outlining the conspiracy. You also receive a copy of the “smoking gun” email…

——-

From: Ted
Date: June 2, 2012
To: Rick

Re: Acme Systems new solutions

“I would say we need to spread as much miss-information and lies about their solution’s capabilities as possible.  We need to throw up as much FUD as we can when we talk to the analyst community to give us time to get our new application to market.  Maybe we can make up a lie about them stealing their IP from a Chinese company.” 

——-

Should I cancel the vacation? …Not yet

You call the VP of IT and ask her if she has the capability to pull an email from 13 months ago. She tells you she does have all of the emails going back two years but there are literally millions of them and it will take weeks to go through them.

You remember getting a demo from Recommind two weeks ago showing their On Demand Review and Analysis platform with a really neat capability to visualize data relationships. So you call up Recommind and setup a quick job.

IT starts the upload of the email data set to the Recommind Cloud platform.

You call your wife and ask her to delay the vacation until Monday…she’s not happy but it could have been worse.

The next morning (Saturday) you meet your team at the office and sign into the hosted eDiscovery platform and pull up the visualization module and run a search against the uploaded email data set for any mention of ACME Systems. Out of the 2 million emails you get hits on 889 emails.

You then ask the system to graphically show the messages by sender and recipient. You quickly find Ted and Rick and their email and even one from Rick to David… Interesting.

Within the hour you are able to assemble the entire conversation thread:

Email #1

From: CEO
Date: May 29, 2012
To: Sandra; Steve

Subject: Acme Systems new solutions

Please give some thought about what we should do to keep momentum going with our sales force in response to ACME Systems latest release of their new router. I can see our sales force getting discouraged with this new announcement.

Please get back to me with some ideas early next week.

Thanks Greg

Email #2

From: Steve
Date: May 29, 2012
To: Greg; Sandra

Re: Acme Systems new solutions

Greg, I will get with Sandra and others and brainstorm this topic no later than tomorrow and get back to you. Sandra, what times are good for you to get together?

Thanks Steve

 

Email #3

From: Sandra
Date: May 30, 2012
To: Ted

Re: Acme Systems new solutions

Ted, considering ACME’s new router announcement, how do you think we should counter their PR?

Thanks Sandra

 

Email #4

From: Ted
Date: June 1, 2012
To: Sandra; Bob

Re: Acme Systems new solutions

If it wasn’t illegal, I would suggest we need to spread as much misinformation about their new router as possible to the analyst community to create as mush FUD as we can to give us time to get our new solution out. Maybe we can make up a lie about them stealing their IP from a Chinese company.

But obviously that’s illegal (right?). Anyway…I suggest we highlight our current differentiators and produce a roadmap showing how and when we will catch and surpass them.

Regards Ted

 

Email #5

From: Rick
Date: June 1, 2012
To: Ted

Re: Acme Systems new solutions

Ted, I heard you had a funny suggestion for what we should do about ACME’s new router… What did you say?

Thanks Bob

 

Email #6 (The incriminating email)

From: Ted
Date: June 2, 2012
To:  Rick

Re: ACME Systems new solutions

“I would say we need to spread as much miss-information and lies about their solution’s capabilities as possible.  We need to throw up as much FUD as we can when we talk to the analyst community to give us time to get our new application to market.  Maybe we can make up a lie about them stealing their IP from a Chinese company.”

It looks like I will make the flight Monday morning after all…

The moral of the story

Circumstances often dictate the need for additional technical capabilities and experience levels to be acquired – quickly. The combination of rising levels of litigation, skyrocketing volumes of information being stored, tight budgets, short deadlines, resource constraints, and extraordinary legal considerations can put many organizations involved in litigation at a major disadvantage.

The relentless growth of data, especially unstructured data, is swamping many organizations. Employees create and receive large amounts of data daily, a majority of it is email – and most of it is simply kept because employees don’t have the time to spend making a decision on each work document or email whether it rises to the level of a record or important business document that may be needed later. The ability to visualize large data sets provides users the opportunity to get to the heart of the matter quickly instead of looking at thousands of lines of text in a table.

Ask the Magic 8-Ball; “Is Predictive Defensible Disposal Possible?”


The Good Ole Days of Paper Shredding

In my early career, shred days – the scheduled annual activity where the company ordered all employees to wander through all their paper records to determine what should be disposed of, were common place. At the government contractor I worked for, we actually wheeled our boxes out to the parking lot to a very large truck that had huge industrial shredders in the back. Once the boxes of documents were shredded, we were told to walk them over to a second truck, a burn truck, where we, as the records custodian, would actually verify that all of our records were destroyed. These shred days were a way to actually collect, verify and yes physically shred all the paper records that had gone beyond their retention period over the preceding year.

The Magic 8-Ball says Shred Days aren’t Defensible

Nowadays, this type of activity carries some negative connotations with it and is much more risky. Take for example the recent case of Rambus vs SK Hynix. In this case U.S District Judge Ronald Whyte in San Jose reversed his own prior ruling from a 2009 case where he had originally issued a judgment against SK Hynix, awarding Rambus Inc. $397 million in a patent infringement case. In his reversal this year, Judge Whyte ruled that Rambus Inc. had spoliated documents in bad faith when it hosted company-wide “shred days” in 1998, 1999, and 2000. Judge Whyte found that Rambus could have reasonably foreseen litigation against Hynix as early as 1998, and that therefore Rambus engaged in willful spoliation during the three “shred days” (a finding of spoliation can be based on inadvertent destruction of evidence as well). Because of this recent spoliation ruling, the Judge reduced the prior Rambus award from $397 million to $215 million, a cost to Rambus of $182 million.

Another well know example of sudden retention/disposition policy activity that caused unintended consequences is the Arthur Andersen/Enron example. During the Enron case, Enron’s accounting firm sent out the following email to some of its employees:

This email was a key reason why Arthur Andersen ceased to exist shortly after the case concluded. Arthur Andersen was charged with and found guilty of obstruction of justice for shredding the thousands of documents and deleting emails and company files that tied the firm to its audit of Enron. Less than 1 year after that email was sent, Arthur Andersen surrendered its CPA license on August 31, 2002, and 85,000 employees lost their jobs.

Learning from the Past – Defensible Disposal

These cases highlight the need for a true information governance process including a truly defensible disposal capability. In these instances, an information governance process would have been capturing, indexing, applying retention policies, protecting content on litigation hold and disposing of content beyond the retention schedule and not on legal hold… automatically, based on documented and approved legally defensible policies. A documented and approved process which is consistently followed and has proper safeguards goes a long way with the courts to show good faith intent to manage content and protect that content subject to anticipated litigation.

To successfully automate the disposal of unneeded information in a consistently defensible manner, auto-categorization applications must have the ability to conceptually understand the meaning in unstructured content so that only content meeting your retention policies, regardless of language, is classified as subject to retention.

Taking Defensible Disposal to the Next Level – Predictive Disposition

A defensible disposal solution which incorporates the ability to conceptually understand content meaning, and which incorporates an iterative training process including “train by example,” in a human supervised workflow provides accurate predictive retention and disposition automation.

Moving away from manual, employee-based information governance to automated information retention and disposition with truly accurate (95 to 99%) and consistent meaning-based predictive information governance will provide the defensibility that organizations require today to keep their information repositories up to date.

Predicting the Future of Information Governance


Information Anarchy

Information growth is out of control. The compound average growth rate for digital information is estimated to be 61.7%. According to a 2011 IDC study, 90% of all data created in the next decade will be of the unstructured variety. These facts are making it almost impossible for organizations to actually capture, manage, store, share and dispose of this data in any meaningful way that will benefit the organization.

Successful organizations run on and are dependent on information. But information is valuable to an organization only if you know where it is, what’s in it, and what is shareable or in other words… managed. In the past, organizations have relied on end-users to decide what should be kept, where and for how long. In fact 75% of data today is generated and controlled by individuals. In most cases this practice is ineffective and causes what many refer to as “covert orunderground archiving”, the act of individuals keeping everything in their own unmanaged local archives. These underground archives effectively lock most of the organization’s information away, hidden from everyone else in the organization.

This growing mass of information has brought us to an inflection point; get control of your information to enable innovation, profit and growth, or continue down your current path of information anarchy and choke on your competitor’s dust.

img-pred-IG

Choosing the Right Path

How does an organization ensure this infection point is navigated correctly? Information Governance. You must get control of all your information by employing the proven processes and technologies to allow you to create, store, find, share and dispose of information in an automated and intelligent manner.

An effective information governance process optimizes overall information value by ensuring the right information is retained and quickly available for business, regulatory, and legal requirements.  This process reduces regulatory and legal risk,  insures needed data can be found quickly and is secured for litigation,  reduces overall eDiscovery costs, and provides structure to unstructured information so that employees can be more productive.

Predicting the Future of Information Governance

Predictive Governance is the bridge across the inflection point. It combines machine-learning technology with human expertise and direction to automate your information governance tasks. Using this proven human-machine iterative training capability,Predictive Governance is able to accurately automate the concept-based categorization, data enrichment and management of all your enterprise data to reduce costs, reduce risks, enable information sharing and mitigate the strain of information overload.

Automating information governance so that all enterprise data is captured, granularity evaluated for legal requirements, regulatory compliance, or business value and stored or disposed of in a defensible manner is the only way for organizations to move to the next level of information governance.

Finding the Cure for the Healthcare Unstructured Data Problem


Healthcare information/ and records continue to grow with the introduction of new devices and expanding regulatory requirements such as The Affordable Care Act, The Health Insurance Portability and Accountability Act (HIPAA), and the Health Information Technology for Economic and Clinical Health Act (HITECH). In the past, healthcare records were made up of mostly paper forms or structured billing data; relatively easy to categorize, store, and manage.  That trend has been changing as new technologies enable faster and more convenient ways to share and consume medical data.

According to an April 9, 2013 article on ZDNet.com, by 2015, 80% of new healthcare information will be composed of unstructured information; information that’s much harder to classify and manage because it doesn’t conform to the “rows & columns” format used in the past. Examples of unstructured information include clinical notes, emails & attachments, scanned lab reports, office work documents, radiology images, SMS, and instant messages.

Who or what is going to actually manage this growing mountain of unstructured information?

To insure regulatory compliance and the confidentiality and security of this unstructured information, the healthcare industry will have to 1) hire a lot more professionals to manually categorize and mange it or 2) acquire technology to do it automatically.

Looking at the first solution; the cost to have people manually categorize and manage unstructured information would be prohibitively expensive not to mention slow. It also exposes private patient data to even more individuals.  That leaves the second solution; information governance technology. Because of the nature of unstructured information, a technology solution would have to:

  1. Recognize and work with hundreds of data formats
  2. Communicate with the most popular healthcare applications and data repositories
  3. Draw conceptual understanding from “free-form” content so that categorization can be accomplished at an extremely high accuracy rate
  4. Enable proper access security levels based on content
  5. Accurately retain information based on regulatory requirements
  6. Securely and permanently dispose of information when required

An exciting emerging information governance technology that can actually address the above requirements uses the same next generation technology the legal industry has adopted…proactive information governance technology based on conceptual understanding of content,  machine learning and iterative “train by example” capabilities

The lifecycle of information


Organizations habitually over-retain information, especially unstructured electronic information, for all kinds of reasons. Many organizations simply have not addressed what to do with it so many of them fall back on relying on individual employees to decide what should be kept and for how long and what should be disposed of. On the opposite end of the spectrum a minority of organizations have tried centralized enterprise content management systems and have found them to be difficult to use so employees find ways around them and end up keeping huge amounts of data locally on their workstations, on removable media, in cloud accounts or on rogue SharePoint sites and are used as “data dumps” with or no records management or IT supervision. Much of this information is transitory, expired, or of questionable business value. Because of this lack of management, information continues to accumulate. This information build-up raises the cost of storage as well as the risk associated with eDiscovery.

In reality, as information ages, it probability of re-use and therefore its value, shrinks quickly. Fred Moore, Founder of Horison Information Strategies, wrote about this concept years ago.

The figure 1 below shows that as data ages, the probability of reuse goes down…very quickly as the amount of saved data rises. Once data has aged 10 to 15 days, its probability of ever being looked at again approaches 1% and as it continues to age approaches but never quite reaches zero (figure 1 – red shading).

Contrast that with the possibility that a large part of any organizational data store has little of no business, legal or regulatory value. In fact the Compliance, Governance and Oversight Counsel (CGOC) conducted a survey in 2012 that showed that on the average, 1% of organizational data is subject to litigation hold, 5% is subject to regulatory retention and 25% had some business value (figure 1 – green shading). This means that approximately 69% of an organizations data store has no business value and could be disposed of without legal, regulatory or business consequences.

The average employee creates, sends, receives and stores conservatively 20 MB of data per day. This means that at the end of 15 business days, they have accumulated 220 MB of new data, at the end of 90 days, 1.26 GB of data and at the end of three years, 15.12 GB of data. So how much of this accumulated data needs to be retained? Again referring to figure 1 below, the blue shaded area represents the information that probably has no legal, regulatory or business value according to the 2012 CGOC survey. At the end of three years, the amount of retained data from a single employee that could be disposed of without adverse effects to the organization is 10.43 GB. Now multiply that by the total number of employees and you are looking at some very large data stores.

Figure 1: The Lifecycle of data

The above lifecycle of data shows us that employees really don’t need all of the data they squirrel away (because its probability of re-use drops to 1% at around 15 days) and based on the CGOC survey, approximately 69% of organizational data is not required for legal, regulatory retention or has business value. The difficult piece of this whole process is how can an organization efficiently determine what data is not needed and dispose of it automatically…

As unstructured data volumes continue to grow, automatic categorization of data is quickly becoming the only way to get ahead of the data flood. Without accurate automated categorization, the ability to find the data you need, quickly, will never be realized. Even better, if data categorization can be based on the meaning of the content, not just a simple rule or keyword match, highly accurate categorization and therefore information governance is achievable.