Archive for January, 2009|Monthly archive page

Searching for Concept Searching at LegalTech New York

Legal vendors far and wide have jumped on the concept search bandwagon after the ruling in Victor Stanley.   But if you ask the typical attorney to define concept searching all she can really tell you is that is isn’t  keyword searching. 

Part of the problem is that vendors often rely on  “black box” technology where users enter a query and get a result, with no understanding of how those results were obtained.  Part of the problem is that understanding how those results were obtained often seems to require a degree in paranormal psychology. Which is exactly why I think this will be one of the top three topics at Legal Tech New York next week.  (The other two?  Early case analysis and figuring out which vendors will still be in business at Legal Tech 2010.)

So can we help define the parameters of this discussion at all?  Well, first it is important to note what concept searching is NOT. It is not a content search which recognizes patterns of  text in documents and then displays documents that have similar content.  Content search engines match documents with search criteria at the page level based on their content by finding text similar to that in a user defined database and then displaying the results based on a ranking of similarity.  This can be a ontology of concepts, categories or relations or, in the case of many applications, it can be based on the publically available WorldNet Lexical database, an open source thesaurus from Princeton University with over 100,000 English words and associations.

This type of software does not perform any type of interpretive analysis of the document and has has  historically been used within the review phase of the ediscovery process.  True concept search technologies, however, are based on semantic evaluation of the data sets and have typically been used for enterprise wide searching and as a result, are better known to legal KM wonks than e-discovery experts.  Herb Roitblat, one of the true experts in this field, has an excellent yet simple example where he describes using word patterns  to find meanings as the difference between the words court and lawyer meaning something legal versus court and basketball meaning something related to sports.

Another expert in this field is Gene Eames of (at least so far) SPi.  Browning Marean and I interviewed Gene on this topic several weeks ago for our E-Discovery Zone  broadcast and I thought it interesting that Gene felt we should look more at the process and the results rather than trying to parse definitions of the technology.  He called it focusing on “….  a formalized search process – which we sometimes refer to as an “iterative criteria refinement process.”  I think this is what Judge Grimm was going for when he said it was common sense that you would sample results of what hit or did not hit on search criteria.  This is in opposition to the more casual, ad hoc approach often taken in e-discovery whereby search terms are dreamed up by counsel, handed over to a vendor, blindly run against a collection, exported to a review platform and off we all go.”

Not to mention far more defensible which is really the point raised by Judge Grimm. 

I’ll be in New York next week and when I’m not speaking or walking around the exhibit hall talking technology, you can find me at the Anacomp CaseLogistix booth.  Stop by ….I’d love to hear about your expereinces with concept searching.

The Paper Prison

This is the term Ralph Losey  used in his comment to last weeks post to describe the lack of basic technical knowledge by attorneys. It can be seen in cases from the Fannie Mae decision with it’s massive financial consequences ( In re Fannie Mae Securities Litigation, _ F.3d _, 2009 WL 215282009, U.S. App. LEXIS 9 (D.C. App. Jan. 6, 2009) to the more recent Convad case (Convad Communications Co. v. Revonet, Inc., 2008 WL 5377698 (D.D.C. Dec. 24, 2008   ) where Judge Faciola decries the use in a discovery request of  ” … ancient boilerplate – designed for discovery in a paper universe ” .

Despite the best efforts of people like Ralph, Craig Ball, Browning Marean and George Socha to educate attorneys there is little likelihood of things changing until, as Ralph said, ” … law schools wake up.”  One promising sign is the Georgetown Law Center Advanced E-Discovery Training Academy to be held at the Georgetown University Law Center, Gewirz Student Center in Washington D.C. this upcoming Feb 9-13.

This is a great first step to formalized, vendor neutral training by attorneys for attorneys, as we used to say in the ABA’s Law Practice Management Section.  But I would still suggest that Ralph is correct  and we need more basic legal technology training at the law school level.  One good start is the law practice management course  at the University of Florida Levin College of Law taught by our old friend Andy Adkins, the Director of the Legal Technology Instituteat the same school.  Another is a series of podcasts on basic legal technology that Browning and I have just started with the West LegalEd Center

Most of these people will be at the Legal Tech Conference  in New York  in two weeks and I’d encourage you to discuss this issue with them if you happen to see them at the show. You’ll find all of these people very approachable and more than willing to discuss this and any other topic on your mind. 

I myself will be speaking on several panels at the show and will also be spending some time at the Anacomp booth and blogging from the show. So please come by and say hi.  I’d love to hear your thoughts and get them posted. We need to further this discussion in order to change this paradigm but it will take the effort of the entire legal community.

As Ralph said: “The paper prison is remarkable strong!”

Same As It Ever Was

The ED community has been buzzing since the holidays ended about the recent decision in In re Fannie Mae Securities Litigation, _ F.3d _, 2009 WL 215282009, U.S. App. LEXIS 9 (D.C. App. Jan. 6, 2009) where the Office of Federal Housing Enterprise Oversight (”OFHEO”), was required to spend six million dollars, representing nine percent of its total annual budget, just to comply with a subpoena for electronic documents when OFHEO was not actually a party to the underlying action.

How does that happen you ask?  Well it seems OFHEO had relevant information in an MDL action against Fannie Mae and Freddie Mac and at some point during a hearing on a  Rule 45 subpoena served by the defendants, counsel for OFHEO appeared and agreed to restore a series of backup tapes, search them using terms provided by the plaintiffs and produce any non-privileged email and attachments.  The Plaintiffs  then came up with 400 key words for searching and the searches yielded over 660,000 documents. And then things got really ugly as OFHEO struggled to engage enough contract attorneys to review the documents, repeatedly asked the trial court for extensions and then failed to meet any of their own extended deadlines.

Now you may ask yourself, “why didn’t they object when the Plaintiffs asked for backup tapes?” or “why didn’t they object when they were presented with 400 search words?” or “why didn’t they object when they retrieved 660,00 documents?”  Good questions one and all but some better questions, and ones the OFHEO attorneys are probably asking themselves now, come from David Byrne. Questions like  “how did I get here?” or “how do I work this?” or “my god what have I done?”  

This episode should never have gotten to the point of 400 search words or 660,00 documents. Let’s assume for a moment that the OFHEO attorney had been asked to agree to have his client search and redact five years worth of archived paper documents from the agency. Would he have agreed? Of course not. His immediate response would have been “Your Honor that request could involve millions of pages of documents and may require thousands of man hours to accomplish. I can’t possibly agree to that without investigating how much expense is involved.” 

Yet when digital records are involved he is suddenly assumed to have no understanding of the most basic concepts? Why is it that just because computers and digital records are involved an attorney is allowed to say “oh sure, email…we can do that no problem.” and his astounding lack of basic technical education is ignored?  Their is no unusual amount of intelligence involved in knowing how many GB a PC may hold and how many pieces of paper that can become if printed out.  Any attorney with the most rudimentary knowledge about his clients PC’s should be able to say to himself:  “the average PC holds roughly 150 GB of data”, “A GB can be roughly  50,000 pages “, “my client has 5,000 employees with computers so that means …ok I’m not that good at math but I think I better stop right here before I agree to anything.”

Ralph Losey states on his blog that : “I am confident that if the government lawyers for OFHEO had had an experienced e-discovery lawyer with them at the first hearing, they would not have stipulated to the order they did, and all of the disasters that followed could have been avoided. But they did not, and as a result, they were bushwhacked.” and Craig Ball, in commenting on the opinion, mentions the “… abysmal lack of expertise respecting keyword search.” 

With all due respect to both Ralph and Craig, whose experience and opinions I hold in the highest possible regard, I cannot agree. The threshold question here is not knowledge of e-discovery technology, it’s the lack of the most  basic technical knowledge by attorneys. Because where they and many others commenting on this opinion see it as a parable for why we should be using concept searching in e-discovery matters I see it just another example of attorneys caught in the old paradigm of working with paper documents and being totally unaware of the most basic technical concepts.  

No, the real problem is one that Browning Marean and I have been trying to combat for over a year and that Ralph himself so accurately pointed out in a recent column: legal education involves no computer education. Why? Because legal education still has it’s own old paradigm. The one that working with a keyboard is not “professional” and is best done by support staff and hourly employees. You know, secretary types. As I was told by the dean of one leading law school when Browning and I tried to have him endorse the legal technology training initiative we have struggled to get underway for over a year now:  “We train architects, not carpenters.”

Great attitude Dean. You might want to tell those architects that they don’t have to use slide rules any longer.
“Watching The Days Go By”.

HASH discussion continued

For those of you who follow the LitSupport listserv you know that this discussion has been continuing ad infinitum. I was having trouble following the often highly technical disucssions but was bothered by the assertions that MD5 hashing could produce “duplicates” so I asked the best experts I know in the forensics field, John Simek of Sensei Enterprises and Atty. Craig Ball.

Now I have to admit that even their responses gave me a headache but here’s a portion of those repsonses which will shed some light on the issue.  John  Simek said:  “…. is it possible to have two different (and actually usable files) contain different contents with the same MD5 hash value. My response is that anything is possible given enough time and money. But is it probable? I say no. ”  Craig Ball went so far as to say that the example mentioned in my previous post “… poses a security issue certainly, but it doesn’t meaningfully impact the viability of MD5 hashes in electronic discovery and computer forensics.”

Why? Well as Craig went on to point out:   “So, yes, it’s possible to create two different files with colliding hash values–I demonstrated that over four years ago when I fashioned and published “apparently intelligible” colliding files building on the work of the Chinese cryptographers which Stefan Fleischman identifies in the article to which you point–but to call these “documents” or leap to the conclusion that we can fashion a colliding, intelligible value for a particular hash value is a big stretch.  …   To create an intelligible “document” (in the sense we speak of a writing or image) and then alter that intelligible document to hash match the value for a known NIST NSRL irrelevant file remains well beyond our reach, and I’m persuaded it will, as a practical matter, remain so for some time.  The people posting on the topic aren’t discussing anything that’s new in the last several years and lack a fundamental understanding of how the vulnerability works and how little impact it has on how we use MD5 in CF and EDD, outside of the public key/private key infrastructure.”

Finally, as a further affirmation of those two opinions, Herb Roitblat of OrcaTec posted a comment on the list serv saying  … “Nevertheless, these are tricks. If there is anything to be learned from this example, it is that displays of electronically stored information may not faithfully reflect the content of the files that they are displaying. The very same code, which is what is hashed, can be used to display wildly disparate information. Look at the native
files.”

Indeed …  look at the native files.  As Craig Ball so succinctly put it:  “The sky is not falling in hashville”.