Orth orthographyabbreviations and variants

Figure 27. This is what the CAMBIO/TIDES Fact DB knows about South Korea

The lexicon for a given language is a collection of superentries which are indexed by the citation form of the word or the phrasal lexical unit (set expression). A superentry includes all the lex-emes which have the same base written form, regardless of syntactic category, pronunciation, or sense. Each lexicon entry is comprised of a number of zones corresponding to the various types of lexical information. The zones containing information for use by an NLP system are: CAT (lex-ical category), ORTH (orthography—abbreviations and variants), PHON (phonology), MORPH (mor-phological irregular forms, class or paradigm, and stem variants or “principal parts”), SYN (syntactic features such as attributive for adjectives), SYN-STRUC (indication of sentence- or phrase-level syntactic dependency, centrally including subcategorization) and SEM-STRUC (lexical semantics, meaning representation). The following scheme, in a BNF-like notation, summarizes the basic lexicon structure. Some additional information is added for human consumption in the ANNOtations zone.

ORTHOGRAPHIC-FORM:

lexeme ::=

cat	v	bought v+past
	stem-v	bought v+past
	stem-v
	def
		“when A buys T from S, A acquires possession of T previously owned

	ex
	ex	“Bill bought a car from Jane”
syn	time-stamp		;the acquirer and the date
syn	syn-class

Page 196