Herb Roitblat on ED Searches

Herb Roitblat, co-founder and a Principal at OrcaTec, responded to the joint ED Search post that Ron Friedmann and I put up last week. His response is lengthy but well worth reading so I am dividing it into two posts based on the two sections of the response itself. Part 1 appears below and I will post the second half tomorrow. 

Hi, guys.  Thanks for your stimulating discussion about search.  I have some thoughts to share. 

First, Ron, you do have the mathematical chops to understand the Categorix white paper.  It is not mathematical capabilities that limit your ability to understand what they are saying.  They just say it in a manner that you are not accustomed to (and one that is maybe designed to produce some mental dazzle).  They use probabilistic latent semantic analysis (PLSA), which is what Recommind uses, to help them to categorize documents.  Ultimately, what they want to do is to compute the probability that a document is in a given category.  There are lots of ways of doing it and the details of how they do it (the fact that they use PLSA) for example, is not particularly relevant to using it.

 They train it on a set of categorized documents and build a model.  Then they compare new documents with this model and classify the new documents into the most likely category according to the model.

 Then they use a really complex method to analyze their results.  I have no idea why they made it so complicated, but they did.  One result is that their reviewers did not agree as often as one would like with each other.

TREC and we have reported similar things.  The human decisions were not all that reliable.  If you want to measure the accuracy of some process, then you need a “truth” standard to compare it with.  Here, they had 5 truth standards, and reasonably good, but not outstanding, I think, success with each of them.  The different thresholds correspond to how biased the system is to call a document responsive.  The higher the threshold, the less likely the system is to call a document responsive.  Notice that recall (the proportion of responsive documents that were identified) goes down from Threshold = 0.5 to threshold = 0.95, just as one would expect.  Notice also that precision (the proportion of identified documents that were responsive) goes up.  How you use the threshold will depend on your case strategy, how sure do you want to be that you find all of the responsive documents (low threshold) vs. how sure do you want to be that you find only responsive documents (high threshold).  Then they do some other tests about consistency that don’t make a lot of sense to me and that seem a bit circular for what they are trying to accomplish.

 I don’t think that bakeoffs hold much value.  The results depend very much on the dataset and on the questions asked.  The TREC folks have found that the ranking of systems does not depend much on the people who do the “truth” relevancy judgments, but the absolute values do.  If product A scored above product B on a particular task, I would not be confident that they would also score above B on different set of data.  I think that the main thing that differs between tools is the ease with which you can accomplish your task.  Practically any tool can be used to select documents for review.  The review process itself is so sloppy that it trumps every other process.  If you could think of all of the right words to search for you could accomplish everything with keyword searching.  The problem, as you know, is that it is practically impossible to think of all the right stuff, so you need analytic tools to help.  The systems differ in how much help they provide and in the amount of effort it takes to get that help (think H5 and the long process they force you through).

 Thinking that litigants should run more than one eDiscovery tool is, I think, way off base.  It is already expensive and time consuming.  Expecting them to double this effort is simply unreasonable.  A much better and cheaper approach is to get them to evaluate their process using quality control methods.  This approach is cheap and effective.

 I think that what makes an approach defensible is to be able to describe what you have done and why you believe that the results are reliable.

Transparency and measurement are the keys there.


4 comments so far

  1. Rob Robinson on

    Incredibly salient comments from Mr. Roitblat – thanks for sharing – and look forward to part two.

    Rob Robinson

  2. Steve Newton on

    The evolution from keyword searching to the intelligent and effective utilization of analytics tools to (1) reduce the volume of the review set, and (2) provide valuable insights into the prioritization and potential pitfalls of certain doc categories, is probably one of the most important steps toward containing and controlling the e-discovery monster. Realizing that users (mostly, the lawyers) are traditionally slow to embrace and absorb such tools into their discovery process, there will be some excellent opportunities for tech savvy attorneys to serve as search and analytic consultants to the legal community, to jump the process ahead, and allow the users to realize the tech benefits sooner rather than later.

  3. uberVU - social comments on

    Social comments and analytics for this post…

    This post was mentioned on Twitter by complexd: Herb Roitblat on ED Searches – http://kuex.us/38f1

  4. […] Herb Roitblat on ED Searches (12 Nov 2009) […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: