[Ltg] HAIL Seminar *Reminder* (17th July): Professor Pam Peters, Macquarie University
Andrew Lampert
Andrew.Lampert at csiro.au
Thu Jul 12 10:35:25 EST 2007
H.A.I.L. Seminar series
CSIRO ICT Centre
http://www.ict.csiro.au/HAIL/
Title: Computer corpora and language description
Speaker: Professor Pam Peters
Director of Dictionary Research Centre
Department of Linguistics
Macquarie University
Date: Tuesday 17th July 2007 at 11am
Location: CSIRO ICT Centre,
Building E6B, Macquarie University.
See <http://www.ict.csiro.au/HAIL/location.htm> for details.
Video: We usually stream live video of seminars.
At the seminar time (see above), point your browser at:
<http://webcast.nsw.csiro.au/httpfs/ICT/HailSeminar/live.asx>
Abstract
This presentation examines ongoing challenges for automatic analysis of (i) standard general language and (ii) specialised sublanguages, where additional layers of meaning (denotative and connotative) are still crucial for sophisticated NLP systems.
Computational techniques based on small, purpose-designed corpora have been used by linguists since the 1960s to quantify lexical and grammatical elements of standard English, and to support intercomparisons between varieties of English. Corpus frequencies can show numerous syntactic divergences between major varieties such as British and American, as reported in the Longman Grammar of Spoken and Written English (Biber et al. 1999). Interesting differences have likewise been found in quantitative studies of new varieties of English such as those of Australia, New Zealand, Singapore, Philippines (Hundt, 2006), e.g. syntactic variables such as the patterns of agreement for collective nouns. However the subtle polysemy of most high frequency words still requires discretionary analysis, to separate common and distinctive senses of words -- despite the availability of mutual information tools. This problem affects usage of the function words and phrases of English, e.g." in case" in British and American English, as well as new usages found in ex-colonial Englishes, e.g. Singaporean use of "until". Conjunctions/prepositions like these define the logical relationships between the content-bearing clauses/phrases of the sentence and are the key to their interpretation. Other regional differences, e.g. the British preference for "about" v American for "around" are more cosmetic. They nevertheless serve to geolocate the text to some extent -- give it a regional tinge which may or may not matter to its writers and readers, and may or may not be reinforced by other more obvious though less frequent regionalisms of the "sidewalk"/"pavement" kind.
Computer-based techniques for profiling specialised forms of language, aka sublanguages, also go back several decates in research by information engineers such as Bross, Shapiro and Anderson (1972) on the language of hospital surgeons. The identification of specialised terms and constructions is based on the principle that they occur with much greater frequency in technical texts than those intended for general reading (e.g. newspapers). A corollary of this is that the corpus needed to profile the terminology of a specialism need not be so large as that needed to support research on the lexis of the standard language (McEnery and Wilson, 1996/2001). Comparative frequency data from general and specialised corpora are effective in identifying the technical terminology of a discipline such as anatomy (Chung, 2003). However the terminology of different academic disciplines is rather variable in scope, and experimental research has shown that the density and structure of terms is quite different in texts from anatomy and, say, applied linguistics (Chung and Nation 2003). A key conceptual issue is whether to include in the inventory of terms only those which are distinctive to the discipline (the traditional terminological approach), or to embrace also those terms which are special uses of everyday words, e.g. "menu" in computer science (the descriptive terminologist's approach). The latter are essential for comprehensive coverage and professional training, but again they raise problems of polysemy for automatic analysis of corpora.
The presentation will demonstrate the combination of computational and discretionary techniques, involving both linguists/lexicographers and disciplinary specialists, which is currently being used at the Dictionary Research Centre to build online termbanks of specialised expressions for academic disciplines in science and social science at Macquarie University (the TermFinder project).
(References can be found at
http://www.ict.csiro.au/HAIL/Abstracts/2007/PamPeters.htm)
Short resume
Pam Peters is Professor of Linguistics at Macquarie University and Director of its Dictionary Research Centre. She has led the compilation of several kinds of computer corpora at Macquarie, and authored reference books on regional English usage, including the Cambridge Guide to English Usage (2004) and the Cambridge Guide to Australian English Usage (2007).
----------------------------------------
The HAIL Seminars' URL:
http://www.ict.csiro.au/HAIL/
Contacts: Andrew Lampert
Address: CSIRO HAIL Seminars,
c/o Andrew Lampert,
Locked Bag 17,
North Ryde NSW 1670
Phone: (02) 9325 3100
Email: Andrew.Lampert at csiro.au
*Administration*
-----------------------------------------------------------------
* To leave the list, send the message "unsubscribe HailSeminars" to
the list server address <majordomo at nsw.cmis.csiro.au>.
The subject line of your e-mail message will be ignored.
More information about the LTG
mailing list