Archive for November, 2009|Monthly archive page

Ron Friedmann In the EDiscovery Zone

Browning Marean and I continue our interviews of e-discovery luminaries in a discussion with Ron Friedmann on everything from changes in the LPO landscape to our current dialogue with Herb Roitblatt on the best methods of searching ESI.

This interview, as well as all our others in the series, can be found at the site of our gracious host, TechLaw Solutions.

Advertisements

ED Search Discussion Continues

The dialogue on search technology continues with the following post from yesterday on Ron Friedmanns blog

Choice of Concept Search Tool in e-Discovery May Matter Less Then You Think [ Litigation Support / e-Discovery ] — Ron @ 1:16 pm

Tom O’Connor and I recently wrote a joint blog post about concept search software for e-discovery. Subsequently, we received comments from Herb Roitblat of Orcatec, an expert in information management, data mining, statistics, and eDiscovery processes. I share his comments here. 

Tom posted at his docNative Paradigm Blog Herb’s comments on Xerox CategoriX and Musings on the Best Approach to EDD Search (29 Oct 2009) by Tom and Ron:

I publish here, with permission, additional comments from Herb, who wrote these in response to a message I sent him with my “take aways” from his first comments.

Summary
My summary and interpretation of Herb’s comments below and in the posts at Tom’s blog is that while concept search is a useful tool for e-discovery, the selection of the specific “flavor” of concept search tool matters less than smart application of it. Tool selection needs to be case specific because a “bake-off” among concept search tools only tells you how well a tool does against a specific set of documents. Since it’s not economically feasible to use multiple tools per case, you need to make a reasonable tool selection at the outset of a the case. As important, you need a reasonable and defensible process (which means documenting tool selection and process). The reasonableness standard depends on the stakes of the case.

Herb and Ron Exchange by E-Mail

Ron: So it sounds like what you are saying is that the difference in e-discovery concept search tools is probably overwhelmed by differences in document sets and in process / control.

Herb: I agree with this, but it has to be said carefully. Clothing does not make the man and high-powered tools do not make the builder, but they do help a good builder do better work. No matter how good your tools are, if they are not used well, you get a questionable result.

Ron: Concept search is not a magic bullet but helps expand the universe of documents to consider because it finds docs with words you would not otherwise think of as search terms.

Herb: It helps you think, but it is not a substitute for thinking. It is, as you say, not a magic bullet, just an amplifier.

Ron: Concept search can also help speed review by clustering similar documents.

Herb: Concept search expands queries to return results that are the best match to the expanded query. Thus, the top results are those that best match the query term and its context. (See green search on Truevert.com for an example, search for meat and get organic meat, not Omaha Steaks). There the context is given by green documents.

Ron: I back away from my initial assertion of the need to use multiple tools. I argued that to spur thinking among EDD professionals. Upon further reflection, what I really meant to say is that lawyers should focus more on industrial processes and controls, statistics, and metrics than on software features.

Herb: That’s what I think.

Ron: So that means we have no magic bullets. The legal profession has hard work ahead to industrialize its processes.

Herb: It’s actually not that hard. You just have to be thoughtful about what you are doing. It is not even terribly burdensome if you are realistic about the levels of accuracy that you can really achieve (see below).

Ron: We still don’t seem to have an objective standard by which to judge if a process is ‘good enough’.

Herb: There are lots of ways of deciding whether a process is good enough and lawyers are used to making reasonableness judgments and arguing about them. What are the consequences of different types of errors (e.g., retrieving too many documents, retrieving too few)?

Scientists, by tradition, usually use a standard of .95 confidence. For example, if two treatments are different with 95% confidence, then we accept them as different. That does not tell us how different they are or that the difference is practically important or useful, only that the difference is statistically significant. Scientists often report higher confidence levels than that, but the minimum is usually .95. That tradition has worked well in science where subsequent research can correct the relatively few times when the difference does not really exist, but resulted from sampling (luck of the draw).

As an analogy, if you play slot machines, the things return only about 95 – 98% of the money that gets pumped into them, but that does not mean that some people don’t actually win large amounts. It happens sometimes. The luck of the draw usually returns less than you put in, but sometimes it returns more.

Back to good enough. Engineers typically use confidence levels to tell them how well to build a bridge. They consider the consequences of different kinds of failure (think of the Tacoma Narrows Bridge). NASA uses confidence levels to determine the quality of their systems. Where the consequences are severe, they require higher confidence.

In eDiscovery, we are familiar with proportionality arguments and the like for determining things like cost shifting. The same thing applies here. A bet the company litigation may merit a higher level of confidence than a run of the mill litigation. Different types of errors may be weighted differently depending on the consequences of that kind of error.

None of this is hard nor does it require very much mathematical background. I published some tables a while back showing how many documents you should sample if you want to achieve a certain level of confidence and you are willing to accept the possibility of missing a certain proportion of responsive documents.

As I think I’ve said, I think that another part of reasonableness is transparency. Be able to describe what you did. A scientific publication is intended to describe enough of the methodology so that another scientist can replicate the observations. I don’t think that you necessarily have to publish to the other side what you did, but you should be able to provide that information if required (think Victor Stanley).

Herb Roitblat On Search: Part 2

Tom, your comments about anyone trying to evaluate ED options are well taken.

 How would you evaluate based on the document sets?  You know exactly what the documents were that were used last year by TREC legal, but how does that help?    I think that the things you need to know include how much variability there is among the documents.  How similar are, in this case, the responsive and nonresponsive documents? I’m not sure how your would evaluate that, but that is a variable that has a large impact.

Distinguishing news stories about sports from those about the stock market is easier than distinguishing stories about NYSE from those about NASDAQ.  But how do you measure that in a case like this?

 Many systems use the same underlying tools (WordNet, Lucene, dtSearch), so, as you say, there may not be big differences among them.  But so what?  From an academic point of view that would be disappointing, but from a pragmatic eDiscovery point of view it does not much matter.  They could still differ in text extractors, user interface, and in other tools they provide.  They may differ in the completeness of extracting text from attachments, but once it is extracted, the fact that it came from an attachment does not matter to search.

 Using OCR data (as TREC did), though, does make a difference.  The size of the vocabulary explodes with OCR and this can affect results, especially categorization results.

 I don’t see what the unreliability of human review has to do with the value of concept searching.  It seems like a non-sequitur to me.  The TREC results on Boolean searching are somewhat misleading.  The Boolean searches in earlier years were conducted by people who had had years of experience on these data.  If you know enough, you can find anything with Boolean searches. The 2008 TREC results found that H5 could produce the highest accuracy ever seen in TREC legal with their methodology.  From the 2008 summary report, Figure 1 would seem to show that the Boolean run left behind many documents that were found by other systems (http://trec.nist.gov/pubs/trec17/papers/LEGAL.OVERVIEW08.pdf). 

In Table 1, 5 systems ranked above the reference consensus Boolean run (xrefLo8C).  If your goal is to find more responsive documents, then all of the commercial concept search tools will do better than a Boolean tool, because they expand the query to include more documents (and they do other things).

 Both of you seem to think that it would be better to use multiple approaches to find responsive documents.  As I said, I think that this is a mistake and impractical, and is probably not the lesson to be learned from these studies.  If your goal is to be sure that you find ALL of the responsive documents, then produce all the documents.  Give two people the same documents to categorize and there will be some overlap.  It is logically true, that the set of documents found by either will be larger than the set found by one, but that does not mean that you get better results if you use the union of their decisions, you just get more.  If you review documents and I flip a coin, the union of our decisions will be more documents than you identified as responsive, but it won’t be any better a set.

 Any way, thanks for writing these thoughtful pieces.  I hope that people read them.

As an aside, we started a week or so ago to build our own categorizer.  It can organize documents into mutually exclusive categories (A or B, but not both) or into overlapping categories ( A or not A), (B or not B).  We achieve accuracies that are comparable or higher to those in the Categorix white paper.

  Precision            Recall

0.86                    0.86

0.67                    0.89

0.9                      0.88

0.99                    0.93

 The data are from Reuters news stories (and old TREC collection).  The first row is a set of 4 mutually exclusive categories, the rest are of the A / notA variety.  These articles were originally tagged by the Reuters editors.

  Topics

1. PERFORMANCE

MARKETS

GOVERNMENT/SOCIAL

None of the above

 2. ACCOUNTS/EARNINGS

Not ACCOUNTS/EARNINGS

 3. CORPORATE/INDUSTRIAL

not CORPORATE/INDUSTRIAL

 4. SPORTS

not SPORTS

 Regards, Herb

Herb Roitblat on ED Searches

Herb Roitblat, co-founder and a Principal at OrcaTec, responded to the joint ED Search post that Ron Friedmann and I put up last week. His response is lengthy but well worth reading so I am dividing it into two posts based on the two sections of the response itself. Part 1 appears below and I will post the second half tomorrow. 

Hi, guys.  Thanks for your stimulating discussion about search.  I have some thoughts to share. 

First, Ron, you do have the mathematical chops to understand the Categorix white paper.  It is not mathematical capabilities that limit your ability to understand what they are saying.  They just say it in a manner that you are not accustomed to (and one that is maybe designed to produce some mental dazzle).  They use probabilistic latent semantic analysis (PLSA), which is what Recommind uses, to help them to categorize documents.  Ultimately, what they want to do is to compute the probability that a document is in a given category.  There are lots of ways of doing it and the details of how they do it (the fact that they use PLSA) for example, is not particularly relevant to using it.

 They train it on a set of categorized documents and build a model.  Then they compare new documents with this model and classify the new documents into the most likely category according to the model.

 Then they use a really complex method to analyze their results.  I have no idea why they made it so complicated, but they did.  One result is that their reviewers did not agree as often as one would like with each other.

TREC and we have reported similar things.  The human decisions were not all that reliable.  If you want to measure the accuracy of some process, then you need a “truth” standard to compare it with.  Here, they had 5 truth standards, and reasonably good, but not outstanding, I think, success with each of them.  The different thresholds correspond to how biased the system is to call a document responsive.  The higher the threshold, the less likely the system is to call a document responsive.  Notice that recall (the proportion of responsive documents that were identified) goes down from Threshold = 0.5 to threshold = 0.95, just as one would expect.  Notice also that precision (the proportion of identified documents that were responsive) goes up.  How you use the threshold will depend on your case strategy, how sure do you want to be that you find all of the responsive documents (low threshold) vs. how sure do you want to be that you find only responsive documents (high threshold).  Then they do some other tests about consistency that don’t make a lot of sense to me and that seem a bit circular for what they are trying to accomplish.

 I don’t think that bakeoffs hold much value.  The results depend very much on the dataset and on the questions asked.  The TREC folks have found that the ranking of systems does not depend much on the people who do the “truth” relevancy judgments, but the absolute values do.  If product A scored above product B on a particular task, I would not be confident that they would also score above B on different set of data.  I think that the main thing that differs between tools is the ease with which you can accomplish your task.  Practically any tool can be used to select documents for review.  The review process itself is so sloppy that it trumps every other process.  If you could think of all of the right words to search for you could accomplish everything with keyword searching.  The problem, as you know, is that it is practically impossible to think of all the right stuff, so you need analytic tools to help.  The systems differ in how much help they provide and in the amount of effort it takes to get that help (think H5 and the long process they force you through).

 Thinking that litigants should run more than one eDiscovery tool is, I think, way off base.  It is already expensive and time consuming.  Expecting them to double this effort is simply unreasonable.  A much better and cheaper approach is to get them to evaluate their process using quality control methods.  This approach is cheap and effective.

 I think that what makes an approach defensible is to be able to describe what you have done and why you believe that the results are reliable.

Transparency and measurement are the keys there.