Herb Roitblat On Search: Part 2

Tom, your comments about anyone trying to evaluate ED options are well taken.

 How would you evaluate based on the document sets?  You know exactly what the documents were that were used last year by TREC legal, but how does that help?    I think that the things you need to know include how much variability there is among the documents.  How similar are, in this case, the responsive and nonresponsive documents? I’m not sure how your would evaluate that, but that is a variable that has a large impact.

Distinguishing news stories about sports from those about the stock market is easier than distinguishing stories about NYSE from those about NASDAQ.  But how do you measure that in a case like this?

 Many systems use the same underlying tools (WordNet, Lucene, dtSearch), so, as you say, there may not be big differences among them.  But so what?  From an academic point of view that would be disappointing, but from a pragmatic eDiscovery point of view it does not much matter.  They could still differ in text extractors, user interface, and in other tools they provide.  They may differ in the completeness of extracting text from attachments, but once it is extracted, the fact that it came from an attachment does not matter to search.

 Using OCR data (as TREC did), though, does make a difference.  The size of the vocabulary explodes with OCR and this can affect results, especially categorization results.

 I don’t see what the unreliability of human review has to do with the value of concept searching.  It seems like a non-sequitur to me.  The TREC results on Boolean searching are somewhat misleading.  The Boolean searches in earlier years were conducted by people who had had years of experience on these data.  If you know enough, you can find anything with Boolean searches. The 2008 TREC results found that H5 could produce the highest accuracy ever seen in TREC legal with their methodology.  From the 2008 summary report, Figure 1 would seem to show that the Boolean run left behind many documents that were found by other systems (http://trec.nist.gov/pubs/trec17/papers/LEGAL.OVERVIEW08.pdf). 

In Table 1, 5 systems ranked above the reference consensus Boolean run (xrefLo8C).  If your goal is to find more responsive documents, then all of the commercial concept search tools will do better than a Boolean tool, because they expand the query to include more documents (and they do other things).

 Both of you seem to think that it would be better to use multiple approaches to find responsive documents.  As I said, I think that this is a mistake and impractical, and is probably not the lesson to be learned from these studies.  If your goal is to be sure that you find ALL of the responsive documents, then produce all the documents.  Give two people the same documents to categorize and there will be some overlap.  It is logically true, that the set of documents found by either will be larger than the set found by one, but that does not mean that you get better results if you use the union of their decisions, you just get more.  If you review documents and I flip a coin, the union of our decisions will be more documents than you identified as responsive, but it won’t be any better a set.

 Any way, thanks for writing these thoughtful pieces.  I hope that people read them.

As an aside, we started a week or so ago to build our own categorizer.  It can organize documents into mutually exclusive categories (A or B, but not both) or into overlapping categories ( A or not A), (B or not B).  We achieve accuracies that are comparable or higher to those in the Categorix white paper.

  Precision            Recall

0.86                    0.86

0.67                    0.89

0.9                      0.88

0.99                    0.93

 The data are from Reuters news stories (and old TREC collection).  The first row is a set of 4 mutually exclusive categories, the rest are of the A / notA variety.  These articles were originally tagged by the Reuters editors.

  Topics

1. PERFORMANCE

MARKETS

GOVERNMENT/SOCIAL

None of the above

 2. ACCOUNTS/EARNINGS

Not ACCOUNTS/EARNINGS

 3. CORPORATE/INDUSTRIAL

not CORPORATE/INDUSTRIAL

 4. SPORTS

not SPORTS

 Regards, Herb

Advertisements

No comments yet

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: