The Hong Kong Cantonese Child Language Corpus (CANCORP) is a longitudinal record of the early language development of 8 Cantonese-speaking children, each of whom was observed for one year from the time when they were between one and a half to two years old. Four of the children are male, and the other four female. The database is deposited both at the Arts Faculty Server of the Chinese University of Hong Kong and at the CHILDES (Child Language Data Exchange System) archive at Carnegie Mellon University.

The corpus grew out of the project "The development of grammatical competence in Cantonese-speaking children" funded by the Hong Kong Research Grants Council from 1991-93, which was a joint effort of three local universities: The Chinese University of Hong Kong, the Hong Kong Polytechnic University, and the University of Hong Kong. Members of the research team consisted of: Thomas Hun-tak Lee (principal investigator, CUHK), Colleen Wong (co-investigator, HKPU), Patricia Yuk-hing Man (HKPU), Alice Shuk-yee Cheung (HKPU), Kitty Szeto (HKU), Cathy Sin-Ping Wong (CUHK, Hawaii) and Samuel Cheung-Shing Leung (co-investigator, HKU).

The database contains 171 files coded according to the internationally accepted CHAT format (Codes for the Human Analysis of Transcripts) and tagged with 33 parts-of-speech labels. The files contain episodes of conversational exchanges between children and adults, with each utterance represented in Chinese characters, romanizations as well as corresponding parts-of-speech tags.

The data should be of use to anyone interested in early language development, be they linguists, psychologists, philosophers or educationalists. Queries about the corpus should be directed to Thomas Lee (htlee@netvigator.com). Suggestions about the homepage can be sent to Ann Law (aylaw99@yahoo.com).


[HOME] [The Project] [Sample Files] [Highlights of Findings]