TAPoR 2.0

Discover Research Tools for Textual Study

  • Browse Tools by Type or Tag
  • Search and Use Tools
  • Read and Create Tool Reviews
  • Contribute and Advertise Tools

TAPoR 2.5 is scheduled for decommissioning.
Please visit TAPoR 3

Last Updated: Dec 30, 2014

XTRACT was a tool for lexical collocation developed by Frank Smadja, then of Columbia University. It was designed to use statistical techniques to identify collocations of aribitrary length, and to generate syntactic relationships between words. This tool is no longer available and does not have a web presence.

Have you tried this tool? Please contribute your rating and comment on your experience.
DocumentationAttributesUser Supplied Tags
Author(s): Frank Smadja
Created: Nov 30, 2012
Last Updated: Dec 30, 2014
Background processing Doesn't run in background
Ease of use Moderate
Historic tool (developed before 2005) No longer in active development
Type of analysis Natural language processing
Web usable Other
, , , ,
June 18, 2013 08:54 PM

XTRACT was a tool for lexical collocation developed in the 1980s and early 1990s. More information on its development and functionality may be found in the following:

Smadja, Frank. "Retrieving Collocations from Text: XTRACT." Computational Linguistics 19.1 (1993): 143-177. Web.

Smadja, Frank. "XTRACT: An Overview." Computers and the Humanities 26.5-6 (1992): 399-413. Web.

September 27, 2013 05:48 PM

Program Overview

XTRACT was a lexical collocation tool developed by Frank Smadja in the early 1990s that used statistical techniques for retrieving and identifying collocations in a large textual corpora (Smadja, "XTRACT" 399).

Rather than a single tool, XTRACT was a set of tools for locating words in context and making statistical observations to identify collocations (Smadja, "Retrieving Collocations" 150). Smadja aimed to promote XTRACT as useful for any text-based application, such as language generation, retrieving grammatical collocations (Smadja, "Retrieving Collocations" 171), or generating a multilingual lexicography (Smadja, "Retrieving Collocations" 174).

In 1993, Smadja rolled out an expanded and refined version of XTRACT which computed more information and was optimized to improve how much fuctional information it could extract (Smadja, "Retrieving Collocations" 150). According to Smadja, the 1993 version of XTRACT worked in three stages:

In the first stage, pairwise lexical relations are retrieved using only statistical information....In the second stage, multiple-word combinations and complex expressions are identified....Finally, by combining parsing and statistical techniques the third stage labels and filters collocations retrieved at stage one. The third stage has been evaluated to raise the precision of Xtract from 40% to 80% with a recall of 94%. (Smadja, "Retrieving Collocations" 145)

Smadja also found that the third stage could "be considered as a retrieval system that retrieves valid collocations from a set of candidates" (Smadja, "Retrieving Collocations" 166).

It used statistics to retrieve pairwise lexical relations from a corpus where they are correlated within a sentence, and "retain[ed] words (or parts of speech) occupying a position with probability greater than a given threshold" (Smadja, "Retrieving Collocations" 151). XTRACT could also apply its collocations to producing other lexicographic output, such as adding syntax (Smadja, "Retrieving Collocations" 161), producing tagged concordances, parsing texts and labeling sentences (Smadja, "Retrieving Collocations" 162).

XTRACT's results varied based on the size of the corpus (Smadja, "Retrieving Collocations" 168). Smadja found that the program was not effective low-frequency words, which negatively impacted smaller texts because they did not have a large enough distribution amongst their collocates (168). The results also varied based on the content of a corpus, evident with Smadja's example using Wall Street data:

Food is not eaten at Wall Street but rather traded, sold, offered, bought, etc. If the corpus only contains stories in a given domain, most of the collocations retrieved will also be dependent on this domain...in addition to jargonistic words, there are a number of more familiar terms that form collocations when used in different domains. A corpus containing stock market stories is obviously not a good choice for retrieving collocations related to weather reports or for retrieving domain independent collocations such as "make-decision." (Smadja, "Retrieving Collocations" 169)

In Smadja's view, XTRACT produced results with the highest quality and greatest range of collocations, which he credits to its filtering system and syntactic labeling (Smadja, "XTRACT" 411).



Smadja, Frank. "Retrieving Collocations from Text: XTRACT." Computational Linguistics 19.1 (1993): 143-177. Web.

Smadja, Frank. "XTRACT: An Overview." Computers and the Humanities 26.5-6 (1992): 399-413. Web.

TAPoR 2.5 is scheduled to be decommissioned. Please visit the beta version of TAPoR 3.0. Thank you.
Please login to rate this tool.
People also used