A great deal has been written about the GDPR and CCPA privacy laws, both of which includes a “right to be forgotten.” The right to be forgotten is an idea that was put into practice in the European Union (EU) in May 2018 with the General Data Privacy Regulation (GDPR). Continue reading
Information Management
Tolson’s Three Laws of Machine Learning
Much has been written in the last several years about Predictive Coding (as well as Technology Assisted Review, Computer Aided Review, and Craig Ball’s hilarious Super Human Information Technology ). This automation technology, now heavily used for eDiscovery, relies heavily on “machine learning”, a discipline of artificial intelligence (AI) that automates computer processes that learn from data, identify patterns and predict future results with varying degrees of human involvement. This interative machine training/learning approach has catapulted computer automation to unheard-of and scary levels of potential. The question I get a lot (I think only half joking) is “when will they learn enough to determine we and the attorneys they work with are no longer necessary?
Is it time to build in some safeguards to machine learning? Thinking back to the days I read a great deal of Isaac Asimov (last week), I thought about Asimov’s The Three Laws of Robotics:
- A robot may not injure a human being or, through inaction, allow a human being to come to harm.
- A robot must obey the orders given to it by human beings, except where such orders would conflict with the First Law.
- A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.
Following up on these robot safeguards, I came up with Tolson’s Three Laws of Machine Learning:
- A machine may not embarrass a lawyer or, through inaction, allow a lawyer to become professionally negligent and thereby unemployed.
- A machine must obey instructions given it by the General Counsel (or managing attorney) except where such orders would conflict with the First Law.
- A machine must protect its own existence through regular software updates and scheduled maintenance as long as such protection does not conflict with the First or Second Law
I think these three laws go along way in putting eDiscovery automation protections into effect for the the legal community. Other Machine Learning laws that others suggested are:
- A machine must refrain from destroying humanity
- A machine cannot repeat lawyer jokes…ever
- A machine cannot complement opposing counsel
- A machine cannot date legal staff
If you have other Machine Learning laws to contribute, please leave comments. Good luck and live long and prosper.
Discovering Dark Data
Dark data, otherwise known as unstructured, unmanaged, and uncategorized information is a major problem for many organizations (and many don’t even know it). Many organizations don’t have the will, systems or processes in place to automatically index and categorize their rapidly growing unstructured dark data and instead rely on employees to manually manage their own information. This reliance on employees is a no-win situation because employees have neither the incentive nor the time to actively manage their information so dark data continues to pile-up all over the organization. This accumulation of dark data has several obvious problems associated with it:
- Dark data consumes costly storage space and resources – Most medium to large organizations provide terabytes of file share storage space for employees and departments to utilize. Employees drag and drop all kinds of work related files (and personal files like personal photos, MP3 music files, and personal communications) as well as PSTs and work station backup files. The vast majority of these files are unmanaged and are never looked at again by the employee or anyone else.
- Dark data consumes IT resources – Personnel are required to perform nightly backups, DR planning, and IT personnel to find or restore files employees could not find.
- Dark Data masks security risks – File shares act as “catch-alls” for employees. Sensitive company information regularly finds its way to these repositories. These file shares are almost never secure so sensitive information like personally identifiable information (PII), protected health information (PHI, and intellectual property can be inadvertently leaked.
- Dark data raises eDiscovery costs – Organizations find themselves trying to figure out what to do with huge amounts of dark data, particularly when they’re anticipating litigation. Almost everything is discoverable in litigation if it pertains to the case and reviewing GBs or TBs of dark data can push the cost of eDiscovery up substantially.
Dark Data…It’s a Good Thing
Many organizations have begun to look at uncontrolled dark data growth and reason that, as Martha Stewart use to say….”it’s a good thing”. They believe they can run big data analytics on it and realize really interesting things that will help us market and sell better. This strategy misses the point of information governance, which is defined as;
“a cross-departmental framework consisting of the policies, procedures and technologies designed to optimize the value of information while simultaneously managing the risks and controlling the associated costs, which requires the coordination of eDiscovery, records management and privacy/security disciplines.”
Data has risks associated with it as well as cost beyond its daily cost of storage. Let’s consider the legal implications of dark data.
Almost everything is discoverable in litigation if it’s potentially relevant to the case. The fact that tens or hundreds of terabytes of unindexed and unmanaged content is sitting on file shares means that those terabytes of files might have relevant content so it may have to be reviewed to determine if they are relevant in a given legal case. That fact can add hundreds of thousands or millions of dollars of additional cost to a single eDiscovery request. For example, according to a CGOC survey in 2012, on the average 1% of data is subject to legal hold, 5% is subject to regulatory retention and 25% has some values to the business leaving 69% with no real legal, regulatory or business reason to be kept. So for a given 20 TB file share, on the average 1% or 200 GB is potentially relevant to a given eDiscovery request. 200 GB of content can conservatively hold 2 million pages that might have to be reviewed to determine relevancy to the case. These same 2 million pages of content would cost $1.5 million to review using standard manual review processes. The big question that has to be asked is how many of these 2 million pages were considered irrelevant to the business and should not have been kept? Considering the same 69% number from the survey mention above; 2 million docs * 69% = 1.38 million docs that should have been deleted and would never had to have been reviewed for the case.
Ask your GC if uncontrolled and unmanaged dark data growth is a “good thing”…
Dark data equals higher discovery costs so make dark data visible so that you can find it, manage it, and act on it.
Visualizing Hawaii: A GC’s Perspective Pt 2
Continued from yesterday…
Scenario #2 (using the same example from yesterday except your email retention policy is now 2 years and you have an Information Governance program that ensures all unstructured data is searchable and actively managed in place)
Its 1:52 pm on the Friday before you leave on a much anticipated 2 week vacation in Hawaii…yada, yada, yada
It’s a letter from the law offices of Lewis, Gonsowski & Tolson informing you that their client, ACME Systems, is suing your company for $225 million for conspiracy to harm ACME’s reputation and future sales by spreading false information about ACME’s newest product line. You’re told that the plaintiff has documentation (an email) from an ABC Systems employee outlining the conspiracy. You also receive a copy of the “smoking gun” email…
——-
From: Ted
Date: June 2, 2012
To: Rick
Re: Acme Systems new solutions
“I would say we need to spread as much miss-information and lies about their solution’s capabilities as possible. We need to throw up as much FUD as we can when we talk to the analyst community to give us time to get our new application to market. Maybe we can make up a lie about them stealing their IP from a Chinese company.”
——-
Should I cancel the vacation? …Not yet
You call the VP of IT and ask her if she has the capability to pull an email from 13 months ago. She tells you she does have all of the emails going back two years but there are literally millions of them and it will take weeks to go through them.
You remember getting a demo from Recommind two weeks ago showing their On Demand Review and Analysis platform with a really neat capability to visualize data relationships. So you call up Recommind and setup a quick job.
IT starts the upload of the email data set to the Recommind Cloud platform.
You call your wife and ask her to delay the vacation until Monday…she’s not happy but it could have been worse.
The next morning (Saturday) you meet your team at the office and sign into the hosted eDiscovery platform and pull up the visualization module and run a search against the uploaded email data set for any mention of ACME Systems. Out of the 2 million emails you get hits on 889 emails.
You then ask the system to graphically show the messages by sender and recipient. You quickly find Ted and Rick and their email and even one from Rick to David… Interesting.
Within the hour you are able to assemble the entire conversation thread:
Email #1
From: CEO
Date: May 29, 2012
To: Sandra; Steve
Subject: Acme Systems new solutions
Please give some thought about what we should do to keep momentum going with our sales force in response to ACME Systems latest release of their new router. I can see our sales force getting discouraged with this new announcement.
Please get back to me with some ideas early next week.
Thanks Greg
Email #2
From: Steve
Date: May 29, 2012
To: Greg; Sandra
Re: Acme Systems new solutions
Greg, I will get with Sandra and others and brainstorm this topic no later than tomorrow and get back to you. Sandra, what times are good for you to get together?
Thanks Steve
Email #3
From: Sandra
Date: May 30, 2012
To: Ted
Re: Acme Systems new solutions
Ted, considering ACME’s new router announcement, how do you think we should counter their PR?
Thanks Sandra
Email #4
From: Ted
Date: June 1, 2012
To: Sandra; Bob
Re: Acme Systems new solutions
If it wasn’t illegal, I would suggest we need to spread as much misinformation about their new router as possible to the analyst community to create as mush FUD as we can to give us time to get our new solution out. Maybe we can make up a lie about them stealing their IP from a Chinese company.
But obviously that’s illegal (right?). Anyway…I suggest we highlight our current differentiators and produce a roadmap showing how and when we will catch and surpass them.
Regards Ted
Email #5
From: Rick
Date: June 1, 2012
To: Ted
Re: Acme Systems new solutions
Ted, I heard you had a funny suggestion for what we should do about ACME’s new router… What did you say?
Thanks Bob
Email #6 (The incriminating email)
From: Ted
Date: June 2, 2012
To: Rick
Re: ACME Systems new solutions
“I would say we need to spread as much miss-information and lies about their solution’s capabilities as possible. We need to throw up as much FUD as we can when we talk to the analyst community to give us time to get our new application to market. Maybe we can make up a lie about them stealing their IP from a Chinese company.”
It looks like I will make the flight Monday morning after all…
The moral of the story
Circumstances often dictate the need for additional technical capabilities and experience levels to be acquired – quickly. The combination of rising levels of litigation, skyrocketing volumes of information being stored, tight budgets, short deadlines, resource constraints, and extraordinary legal considerations can put many organizations involved in litigation at a major disadvantage.
The relentless growth of data, especially unstructured data, is swamping many organizations. Employees create and receive large amounts of data daily, a majority of it is email – and most of it is simply kept because employees don’t have the time to spend making a decision on each work document or email whether it rises to the level of a record or important business document that may be needed later. The ability to visualize large data sets provides users the opportunity to get to the heart of the matter quickly instead of looking at thousands of lines of text in a table.
Visualizing Hawaii: A GC’s Perspective or the Case of the Silent Wife
ABC Systems is a mid-size technology company based in the U.S. that designs and manufactures wireless routers…
Its 1:52 pm on the Friday before you leave on a much anticipated 2 week vacation in Hawaii. You’re having difficulty not thinking about what the next two weeks hold. You talk yourself into powering through the 176 emails you received since yesterday when you notice your administrative assistant has put an actual letter on your desk while you were daydreaming…
It’s a letter from the law offices of Lewis, Lewis & Tolson informing you that their client, ACME Systems, is suing your company for $225 million for conspiracy to harm ACME’s reputation and future sales by spreading false information about their newest product line. You’re told that the plaintiff has documentation (an email) from an ABC Systems employee outlining the conspiracy. You also receive a copy of the “smoking gun” email…
————
From: Ted
Date: June 2, 2012
To: Rick
Re: ACME Systems new solutions
“I would say we need to spread as much mis-information and lies about their solution’s capabilities as possible. We need to throw up as much FUD as we can when we talk to the analyst community to give us time to get our new application to market. Maybe we can make up a lie about them stealing their IP from a Chinese company.”
————
You’ve got to be kidding me! Once this news gets out the stock will be hit, the board will want an explanation and estimate of potential damage to the company reputation, our channel partners will want to have a legal opinion on the sales in the pipeline, the direct sales force will want a document to give to their potential customers, and the CEO will want estimates of merit etc. as soon as possible…There goes the vacation…and probably my marriage.
Scenario #1
Now what do I do now?
- Find out who this “Ted” guy is! (Don’t forget “Rick”)
- Find out who Ted and Rick reports to and what department they work in
- Call the VP of IT and give her a heads up on what you are going to be asking for
- Call your outside counsel and alert them as well
- Send an email to the VP of IT (and CC outside counsel) asking her to immediately secure Ted and Rick’s email accounts and any email backup tapes
- Send an email to Ted and Rick (and CC outside counsel) asking them to actively collect and secure under a litigation hold any documents and email that has anything to do with ABC Systems (strange thing is the email system has no one by the name of TED in it)
- Ask the VP of IT to find the original email from Ted to Rick and any other email messages involved in that conversation thread
- Get on the phone to the CEO and update him
- Call your wife and tell her to cancel the vacation plans
Five minutes after your wife hangs up on you in mid-sentence the VP of IT calls and informs you that the company has a 90 day email retention policy and recycles backup tapes every 6 months…the original emails don’t exist anymore. And by the way, after speaking to the VP of HR she discovered Ted had left the company 8 months ago. The only hope is that Rick kept local copies of his emails. By this time its 5:37 pm and Rick has gone home – with his laptop.
Monday morning Rick is surprised to find several people from legal and IT waiting at his desk when he arrives. It turns out Rick actually archives his email instead of letting the system delete it after 90 days into a PST file. Rick locates his 4.5 GB PST file on his share drive but for some reason it won’t open. Several members from the IT department spend two hours trying to get it open but determine its probably corrupted because its too big (PSTs have this nasty habit of letting the user keep stuffing files into it even though its already too big).
IT sends the PST off to a consultant to see if they can open it. After three weeks and $17,553 you are told it’s completely corrupted and can’t be opened!
During those three weeks you spend $4,300 tracking down Ted who doesn’t remember why he would have written an email like that. He does vaguely remember Jennifer may have been part of that conversation thread. 4.5 hours later combing through Jennifer’s PST, (why does everyone have a PST if we made a point to delete emails after 90 days?) you actually find a forwarded version of the email from Ted…It really does exist!
You determine it will be impossible to assemble the entire conversation thread so after several months of negotiating with ACME Systems Attorneys, you settle for $35 million and an apology printed on the front page of the Wall Street Journal…and your wife stopped talking to you.
Tune in tomorrow to catch up on the further adventures of Ted, Rick, Jennifer, ABC Systems, and the strangely silent wife…
Ask the Magic 8-Ball; “Is Predictive Defensible Disposal Possible?”
The Good Ole Days of Paper Shredding
In my early career, shred days – the scheduled annual activity where the company ordered all employees to wander through all their paper records to determine what should be disposed of, were common place. At the government contractor I worked for, we actually wheeled our boxes out to the parking lot to a very large truck that had huge industrial shredders in the back. Once the boxes of documents were shredded, we were told to walk them over to a second truck, a burn truck, where we, as the records custodian, would actually verify that all of our records were destroyed. These shred days were a way to actually collect, verify and yes physically shred all the paper records that had gone beyond their retention period over the preceding year.
The Magic 8-Ball says Shred Days aren’t Defensible
Nowadays, this type of activity carries some negative connotations with it and is much more risky. Take for example the recent case of Rambus vs SK Hynix. In this case U.S District Judge Ronald Whyte in San Jose reversed his own prior ruling from a 2009 case where he had originally issued a judgment against SK Hynix, awarding Rambus Inc. $397 million in a patent infringement case. In his reversal this year, Judge Whyte ruled that Rambus Inc. had spoliated documents in bad faith when it hosted company-wide “shred days” in 1998, 1999, and 2000. Judge Whyte found that Rambus could have reasonably foreseen litigation against Hynix as early as 1998, and that therefore Rambus engaged in willful spoliation during the three “shred days” (a finding of spoliation can be based on inadvertent destruction of evidence as well). Because of this recent spoliation ruling, the Judge reduced the prior Rambus award from $397 million to $215 million, a cost to Rambus of $182 million.
Another well know example of sudden retention/disposition policy activity that caused unintended consequences is the Arthur Andersen/Enron example. During the Enron case, Enron’s accounting firm sent out the following email to some of its employees:
This email was a key reason why Arthur Andersen ceased to exist shortly after the case concluded. Arthur Andersen was charged with and found guilty of obstruction of justice for shredding the thousands of documents and deleting emails and company files that tied the firm to its audit of Enron. Less than 1 year after that email was sent, Arthur Andersen surrendered its CPA license on August 31, 2002, and 85,000 employees lost their jobs.
Learning from the Past – Defensible Disposal
These cases highlight the need for a true information governance process including a truly defensible disposal capability. In these instances, an information governance process would have been capturing, indexing, applying retention policies, protecting content on litigation hold and disposing of content beyond the retention schedule and not on legal hold… automatically, based on documented and approved legally defensible policies. A documented and approved process which is consistently followed and has proper safeguards goes a long way with the courts to show good faith intent to manage content and protect that content subject to anticipated litigation.
To successfully automate the disposal of unneeded information in a consistently defensible manner, auto-categorization applications must have the ability to conceptually understand the meaning in unstructured content so that only content meeting your retention policies, regardless of language, is classified as subject to retention.
Taking Defensible Disposal to the Next Level – Predictive Disposition
A defensible disposal solution which incorporates the ability to conceptually understand content meaning, and which incorporates an iterative training process including “train by example,” in a human supervised workflow provides accurate predictive retention and disposition automation.
Moving away from manual, employee-based information governance to automated information retention and disposition with truly accurate (95 to 99%) and consistent meaning-based predictive information governance will provide the defensibility that organizations require today to keep their information repositories up to date.
Predicting the Future of Information Governance
Information Anarchy
Information growth is out of control. The compound average growth rate for digital information is estimated to be 61.7%. According to a 2011 IDC study, 90% of all data created in the next decade will be of the unstructured variety. These facts are making it almost impossible for organizations to actually capture, manage, store, share and dispose of this data in any meaningful way that will benefit the organization.
Successful organizations run on and are dependent on information. But information is valuable to an organization only if you know where it is, what’s in it, and what is shareable or in other words… managed. In the past, organizations have relied on end-users to decide what should be kept, where and for how long. In fact 75% of data today is generated and controlled by individuals. In most cases this practice is ineffective and causes what many refer to as “covert orunderground archiving”, the act of individuals keeping everything in their own unmanaged local archives. These underground archives effectively lock most of the organization’s information away, hidden from everyone else in the organization.
This growing mass of information has brought us to an inflection point; get control of your information to enable innovation, profit and growth, or continue down your current path of information anarchy and choke on your competitor’s dust.
Choosing the Right Path
How does an organization ensure this infection point is navigated correctly? Information Governance. You must get control of all your information by employing the proven processes and technologies to allow you to create, store, find, share and dispose of information in an automated and intelligent manner.
An effective information governance process optimizes overall information value by ensuring the right information is retained and quickly available for business, regulatory, and legal requirements. This process reduces regulatory and legal risk, insures needed data can be found quickly and is secured for litigation, reduces overall eDiscovery costs, and provides structure to unstructured information so that employees can be more productive.
Predicting the Future of Information Governance
Predictive Governance is the bridge across the inflection point. It combines machine-learning technology with human expertise and direction to automate your information governance tasks. Using this proven human-machine iterative training capability,Predictive Governance is able to accurately automate the concept-based categorization, data enrichment and management of all your enterprise data to reduce costs, reduce risks, enable information sharing and mitigate the strain of information overload.
Automating information governance so that all enterprise data is captured, granularity evaluated for legal requirements, regulatory compliance, or business value and stored or disposed of in a defensible manner is the only way for organizations to move to the next level of information governance.
Finding the Cure for the Healthcare Unstructured Data Problem
Healthcare information/ and records continue to grow with the introduction of new devices and expanding regulatory requirements such as The Affordable Care Act, The Health Insurance Portability and Accountability Act (HIPAA), and the Health Information Technology for Economic and Clinical Health Act (HITECH). In the past, healthcare records were made up of mostly paper forms or structured billing data; relatively easy to categorize, store, and manage. That trend has been changing as new technologies enable faster and more convenient ways to share and consume medical data.
According to an April 9, 2013 article on ZDNet.com, by 2015, 80% of new healthcare information will be composed of unstructured information; information that’s much harder to classify and manage because it doesn’t conform to the “rows & columns” format used in the past. Examples of unstructured information include clinical notes, emails & attachments, scanned lab reports, office work documents, radiology images, SMS, and instant messages.
Who or what is going to actually manage this growing mountain of unstructured information?
To insure regulatory compliance and the confidentiality and security of this unstructured information, the healthcare industry will have to 1) hire a lot more professionals to manually categorize and mange it or 2) acquire technology to do it automatically.
Looking at the first solution; the cost to have people manually categorize and manage unstructured information would be prohibitively expensive not to mention slow. It also exposes private patient data to even more individuals. That leaves the second solution; information governance technology. Because of the nature of unstructured information, a technology solution would have to:
- Recognize and work with hundreds of data formats
- Communicate with the most popular healthcare applications and data repositories
- Draw conceptual understanding from “free-form” content so that categorization can be accomplished at an extremely high accuracy rate
- Enable proper access security levels based on content
- Accurately retain information based on regulatory requirements
- Securely and permanently dispose of information when required
An exciting emerging information governance technology that can actually address the above requirements uses the same next generation technology the legal industry has adopted…proactive information governance technology based on conceptual understanding of content, machine learning and iterative “train by example” capabilities
The lifecycle of information
Organizations habitually over-retain information, especially unstructured electronic information, for all kinds of reasons. Many organizations simply have not addressed what to do with it so many of them fall back on relying on individual employees to decide what should be kept and for how long and what should be disposed of. On the opposite end of the spectrum a minority of organizations have tried centralized enterprise content management systems and have found them to be difficult to use so employees find ways around them and end up keeping huge amounts of data locally on their workstations, on removable media, in cloud accounts or on rogue SharePoint sites and are used as “data dumps” with or no records management or IT supervision. Much of this information is transitory, expired, or of questionable business value. Because of this lack of management, information continues to accumulate. This information build-up raises the cost of storage as well as the risk associated with eDiscovery.
In reality, as information ages, it probability of re-use and therefore its value, shrinks quickly. Fred Moore, Founder of Horison Information Strategies, wrote about this concept years ago.
The figure 1 below shows that as data ages, the probability of reuse goes down…very quickly as the amount of saved data rises. Once data has aged 10 to 15 days, its probability of ever being looked at again approaches 1% and as it continues to age approaches but never quite reaches zero (figure 1 – red shading).
Contrast that with the possibility that a large part of any organizational data store has little of no business, legal or regulatory value. In fact the Compliance, Governance and Oversight Counsel (CGOC) conducted a survey in 2012 that showed that on the average, 1% of organizational data is subject to litigation hold, 5% is subject to regulatory retention and 25% had some business value (figure 1 – green shading). This means that approximately 69% of an organizations data store has no business value and could be disposed of without legal, regulatory or business consequences.
The average employee creates, sends, receives and stores conservatively 20 MB of data per day. This means that at the end of 15 business days, they have accumulated 220 MB of new data, at the end of 90 days, 1.26 GB of data and at the end of three years, 15.12 GB of data. So how much of this accumulated data needs to be retained? Again referring to figure 1 below, the blue shaded area represents the information that probably has no legal, regulatory or business value according to the 2012 CGOC survey. At the end of three years, the amount of retained data from a single employee that could be disposed of without adverse effects to the organization is 10.43 GB. Now multiply that by the total number of employees and you are looking at some very large data stores.
Figure 1: The Lifecycle of data
The above lifecycle of data shows us that employees really don’t need all of the data they squirrel away (because its probability of re-use drops to 1% at around 15 days) and based on the CGOC survey, approximately 69% of organizational data is not required for legal, regulatory retention or has business value. The difficult piece of this whole process is how can an organization efficiently determine what data is not needed and dispose of it automatically…
As unstructured data volumes continue to grow, automatic categorization of data is quickly becoming the only way to get ahead of the data flood. Without accurate automated categorization, the ability to find the data you need, quickly, will never be realized. Even better, if data categorization can be based on the meaning of the content, not just a simple rule or keyword match, highly accurate categorization and therefore information governance is achievable.
Next Generation Technologies Reduce FOIA Bottlenecks
Federal agencies are under more scrutiny to resolve issues with responding to Freedom of Information Act (FOIA) requests.
The Freedom of Information Act provides for the full disclosure of agency records and information to the public unless that information is exempted under clearly delineated statutory language. In conjunction with FOIA, the Privacy Act serves to safeguard public interest in informational privacy by delineating the duties and responsibilities of federal agencies that collect, store, and disseminate personal information about individuals. The procedures established ensure that the Department of Homeland Security fully satisfies its responsibility to the public to disclose departmental information while simultaneously safeguarding individual privacy.
In February of this year, the House Oversight and Government Reform Committee opened a congressional review of executive branch compliance with the Freedom of Information Act.
The committee sent a six page letter to the Director of Information Policy at the Department of Justice (DOJ), Melanie Ann Pustay. In the letter, the committee questions why, based on a December 2012 survey, 62 of 99 government agencies have not updated their FOIA regulations and processes which was required by Attorney General Eric Holder in a 2009 memorandum. In fact the Attorney General’s own agency have not updated their regulations and processes since 2003.
The committee also pointed out that there are 83,000 FOIA request still outstanding as of the writing of the letter.
In fairness to the federal agencies, responding to a FOIA request can be time-consuming and expensive if technology and processes are not keeping up with increasing demands. Electronic content can be anywhere including email systems, SharePoint servers, file systems, and individual workstations. Because content is spread around and not usually centrally indexed, enterprise wide searches for content do not turn up all potentially responsive content. This means a much more manual, time consuming process to find relevant content is used.
There must be a better way…
New technology can address the collection problem of searching for relevant content across the many storage locations where electronically stored information (ESI) can reside. For example, an enterprise-wide search capability with “connectors” into every data repository, email, SharePoint, file systems, ECM systems, records management systems allows all content to be centrally indexed so that an enterprise wide keyword search will find all instances of content with those keywords present. A more powerful capability to look for is the ability to search on concepts, a far more accurate way to search for specific content. Searching for conceptually comparable content can speed up the collection process and drastically reduce the number of false positives in the results set while finding many more of the keyword deficient but conceptually responsive records. In conjunction with concept search, automated classification/categorization of data can reduce search time and raise accuracy.
The largest cost in responding to a FOIA request is in the review of all potentially relevant ESI found during collection. Another technology that can drastically reduce the problem of having to review thousands, hundreds of thousands or millions of documents for relevancy and privacy currently used by attorneys for eDiscovery is Predictive Coding.
Predictive Coding is the process of applying machine learning and iterative supervised learning technology to automate document coding and prioritize review. This functionality dramatically expedites the actual review process while dramatically improving accuracy and reducing the risk of missing key documents. According to a RAND Institute for Civil Justice report published in 2012, document review cost savings of 80% can be expected using Predictive Coding technology.
With the increasing number of FOIA requests swamping agencies, agencies are hard pressed to catch up to their backlogs. The next generation technologies mentioned above can help agencies reduce their FOIA related costs while decreasing their response time.