A Hong Kong Cantonese Child Language Corpus 1. The corpus This database contains longitudinal data on the language of 8 Cantonese-speaking children, each recorded for approximately one year. The corpus contains 171 files coded in CHAT format and tagged with a set of 33 word class labels. These children were observed in their interactions with the caretakers, the investigator, and occasionally other adults who chatted with the children during the visits. Three research students carried out the observations and the recording. Patricia Man recorded Bohuen and Gakie; Alice Cheung recorded Bernard, Tsuntsun and Tinfaan; and Kitty Szeto recorded Johnny, Jenny and Chunyat. The names of the children and the ages during which they were recorded are as follows: NAME Sex Age at which recording No. of files began and ended Bohuen (wbh) F 2;03;23 - 3;04;08 27 Gakei (cgk) F 1;11;01 - 2;09;09 19 Bernard (mhz) M 1;07;00 - 2;08;06 26 Tsuntsun (ckt) M 1;05;22 - 2;07;22 25 Tinfaan (ltf) F 2;02;10 - 3;02;18 16 Johnny (hhc) M 2;04;08 - 3;04;14 16 Jenny (LLy) F 2;08;10 - 3;08;09 20 Chunyat (ccc) M 1;10;08 - 2;10;27 22 total 171 The file name indicates the name of the child and his/her age at the time of the recording. Each filename is made up of the initials of the child (the first three characters) and his/her age at the time of recording, in terms of Year (1 character), Month (2 characters), and Day (2 characters). All files in the Childes archive have the suffix 'pas'. For instance, the file WBH20322.pas contains tagged utterances of Bohuen (whb) when she was 2 years 3 months and 22 days old. The files of the 8 children and their respective sizes are listed below: Bohuen (27 files) WBH20323 PAS 89,489 WBH20926 PAS 106,287 WBH20330 PAS 67,570 WBH21002 PAS 44,733 WBH20402 PAS 37,631 WBH21016 PAS 56,854 WBH20405 PAS 41,410 WBH21023 PAS 58,323 WBH20406 PAS 58,587 WBH21106 PAS 32,651 WBH20413 PAS 41,079 WBH21114 PAS 53,510 WBH20414 PAS 20,251 WBH21128 PAS 59,433 WBH20415 PAS 16,539 WBH30010 PAS 68,317 WBH20506 PAS 34,773 WBH30101 PAS 73,720 WBH20609 PAS 37,001 WBH30220 PAS 88,637 WBH20619 PAS 20,238 WBH30303 PAS 59,221 WBH20703 PAS 37,328 WBH30312 PAS 71,884 WBH20714 PAS 93,592 WBH30408 PAS 113,452 WBH20919 PAS 87,838 Gakei (19 files) CGK11101 PAS 172,448 CGK20318 PAS 57,860 CGK11108 PAS 70,025 CGK20325 PAS 55,515 CGK11122 PAS 38,073 CGK20408 PAS 84,333 CGK11129 PAS 102,565 CGK20430 PAS 130,782 CGK20008 PAS 63,610 CGK20503 PAS 67,714 CGK20207 PAS 77,541 CGK20711 PAS 116,732 CGK20221 PAS 127,568 CGK20808 PAS 125,984 CGK20228 PAS 58,909 CGK20818 PAS 155,475 CGK20304 PAS 107,718 CGK20909 PAS 186,561 CGK20311 PAS 66,674 Bernard (26 files) MHZ10700 PAS 70,309 MHZ20115 PAS 188,523 MHZ10800 PAS 145,222 MHZ20129 PAS 143,805 MHZ10814 PAS 48,888 MHZ20212 PAS 140,860 MHZ10828 PAS 46,999 MHZ20226 PAS 180,029 MHZ10904 PAS 191,121 MHZ20309 PAS 160,713 MHZ10918 PAS 135,158 MHZ20328 PAS 200,297 MHZ10925 PAS 197,203 MHZ20407 PAS 169,851 MHZ11010 PAS 169,128 MHZ20421 PAS 202,822 MHZ11023 PAS 198,579 MHZ20504 PAS 164,264 MHZ11106 PAS 168,244 MHZ20519 PAS 158,711 MHZ20003 PAS 176,338 MHZ20604 PAS 164,720 MHZ20016 PAS 184,313 MHZ20618 PAS 242,745 MHZ20101 PAS 224,360 MHZ20806 PAS 198,527 Tsuntsun (25 files) CKT10522 PAS 29,169 CKT20016 PAS 169,360 CKT10703 PAS 15,793 CKT20108 PAS 209,431 CKT10710 PAS 19,753 CKT20205 PAS 245,170 CKT10800 PAS 178,766 CKT20215 PAS 273,013 CKT10807 PAS 164,152 CKT20303 PAS 223,396 CKT10821 PAS 161,988 CKT20317 PAS 189,870 CKT10907 PAS 189,404 CKT20400 PAS 214,977 CKT10914 PAS 83,333 CKT20414 PAS 192,366 CKT10929 PAS 212,078 CKT20500 PAS 214,916 CKT11030 PAS 200,329 CKT20514 PAS 186,208 CKT11113 PAS 192,529 CKT20618 PAS 210,200 CKT11127 PAS 173,551 CKT20702 PAS 226,511 CKT20009 PAS 241,588 Tinfaan (16 files) LTF20210 PAS 143,755 LTF20802 PAS 208,942 LTF20302 PAS 138,356 LTF20824 PAS 196,170 LTF20330 PAS 147,790 LTF20907 PAS 176,231 LTF20427 PAS 162,287 LTF21018 PAS 180,737 LTF20518 PAS 117,547 LTF21116 PAS 222,596 LTF20601 PAS 158,564 LTF30020 PAS 206,463 LTF20705 PAS 200,649 LTF30121 PAS 210,528 LTF20720 PAS 183,621 LTF30218 PAS 200,894 Johnny (16 files) HHC20408 PAS 47,817 HHC20930 PAS 156,411 HHC20503 PAS 188,583 HHC21013 PAS 171,020 HHC20513 PAS 261,319 HHC21108 PAS 219,173 HHC20519 PAS 252,334 HHC30008 PAS 160,351 HHC20610 PAS 195,659 HHC30116 PAS 180,137 HHC20624 PAS 189,396 HHC30216 PAS 167,231 HHC20721 PAS 196,018 HHC30311 PAS 207,616 HHC20808 PAS 151,741 HHC30414 PAS 187,927 Jenny (20 files) LLY20810 PAS 88,541 LLY30113 PAS 195,725 LLY20822 PAS 108,168 LLY30130 PAS 198,491 LLY20909 PAS 34,691 LLY30213 PAS 197,223 LLY20914 PAS 162,846 LLY30315 PAS 170,454 LLY20928 PAS 188,326 LLY30326 PAS 163,255 LLY21101 PAS 139,403 LLY30422 PAS 182,423 LLY21108 PAS 180,172 LLY30520 PAS 156,676 LLY21129 PAS 206,843 LLY30616 PAS 187,465 LLY30011 PAS 165,590 LLY30725 PAS 186,118 LLY30022 PAS 173,217 LLY30809 PAS 178,805 Chunyat (22 files) CCC11008 PAS 80,465 CCC20523 PAS 207,903 CCC11100 PAS 31,915 CCC20608 PAS 193,847 CCC11121 PAS 50,132 CCC20624 PAS 170,164 CCC20110 PAS 173,940 CCC20706 PAS 207,702 CCC20117 PAS 100,270 CCC20713 PAS 196,504 CCC20206 PAS 145,920 CCC20800 PAS 200,565 CCC20213 PAS 192,652 CCC20817 PAS 153,548 CCC20307 PAS 185,136 CCC20907 PAS 203,038 CCC20323 PAS 188,144 CCC20923 PAS 176,641 CCC20410 PAS 177,589 CCC21013 PAS 196,346 CCC20507 PAS 143,956 CCC21027 PAS 217,228 2. The background of the 8 Cantonese-speaking children a) Bohuen and Gakei Both children were brought up in monolingual Cantonese-speaking working class families. Bohuen's father was working in the warehouse of a mass transport company and her mother was a part-time piano teacher. The child had a younger brother who was about two years younger than her. They lived with the child's grandmother and uncle. The child had already started attending a nursery school when data collection started. After school, she was taken care of by her parents and grandmother. Gakei's father was a technician in a electronic company and her mother was a housewife. They lived with the child's grandmother. Gakei's parents were both born in Hong Kong. The child was not yet enrolled in a nursery during the whole period of data collection. She was entirely taken care of by her mother. b) Bernard, Tsuntsun and Tinfaan All three were Cantonese-speaking children living in Hong Kong. Tsuntsun and Tinfaan were born in Hong Kong, while Bernard was born in Kent, United Kingdom and was brought back to Hong Kong at the age of 8 1/2 months old. Tsuntsun was the only son of the family. His father was a Census & Survey Officer working in the government and his mother a secondary school teacher teaching Chinese and Religious Studies. Since his birth, he had been living in his maternal grandparents' house during weekdays and was taken care of by his grandmother. His parents visited him occasionally during the weekday evenings and took him back home on Friday nights to stay over the weekend. They communicated in Cantonese. When Tsuntsun was 1 year 10 months old, his mother went to study for a year in the United Kingdom. He started to attend a nursery at the age of 2 years 1 month. Bernard was the only son of the family. His father was a lecturer in the Division of Construction and Land Use of the Hong Kong Polytechnic. His mother was a lecturer of the English Language Teaching Unit of the Chinese University of Hong Kong. Bernard's mother brought him back from the United Kingdom at the age of 8 1/2 months. He was then taken care of by his maternal grandmother at her house until the age of about 1 year 1 month. From that time to the age of 2 years 6 months, he was taken care of by a caretaker during the weekdays. He communicated in Cantonese, though his parents occasionally introduced to him some English terms. He started to attend the nursery play-groups at the age of 2 years 6 months. Tinfaan was the youngest child in the family. She had a sister who was four years older. Her father was an engineer working in the government and her mother was a piano teacher teaching at home. During the first one-and-a-half years from her birth, she was taken care of mostly by a Filipino helper while her mother worked as a school music teacher. After her mother had stopped working in school, Tinfaan was mostly taken care of by her mother, except at times when her mother had to give piano lessons or had to go out, when Tinfaan would be looked after by her Filipino helper. She communicated in Cantonese except when speaking to her Filipino helper, for which she used 'something English-like' (as described by her mother). She started to attend kindergarten at the age of 2 years 9 months. c) Johnny, Jenny and Chunyat All three children were born in Hong Kong and were brought up in monolingual Cantonese-speaking families. They had not started going to a nursery during the period of data collection. Jenny was the youngest child in the family. She had an elder brother who was ten years older and an elder sister who was four years older. Jenny's father was a businessman and her mother was a housewife. The family employed a Filipino helper, who spoke some Cantonese and English to the children. Johnny was the youngest child in the family. He had an elder sister who was seven years older. His father was an engineer and her mother was a typist. The family employed a Thai helper and she spoke Cantonese to the children. Chunyat was the only son in the family. His father was a merchant and his mother taught English in a secondary school. They lived with the child's maternal grandparents. 3. Tags Below is a summary list of the syntactic categories used in coding the corpus. The romanizations are based on the Cantonese romanization scheme of the Linguistic Society of Hong Kong (LSHK) (see Matthews and Yip 1994: 400-401). Category e.g. 1. adj = adjective hung4 2. advf = focus adverb zung6, dou1, jau6, zoi3 3. advi = adverb of intensity hou3, gei2, gam3, zan1 4. advm = adverb of manner maan6maan6dei2, ma4ma4dei2 5. advs = sentential adverb bat1jyu4, gam2(joeng2), jat1cai4 6. asp = aspectual marker zo2, zyu6, gan2, gwo3, hoi1 7. aux = auxiliary / modal verb jing1goi1, hang2, ho2ji5, wui, sai2 8. cl = classifier go3, zek3, bun2, bui1, di1 9. com = comparative morpheme gwo3 (as in dai6 gwo3), di1 (as in hung4 di1) 10. conj = connective dan6hai6, tung4maai4, waak6ze2 11. corr = correlative jut6...jut6, jau6...jau6, gam2...gam3, jat1...jat1 12. ctc = clitic dak1, dou3 13. det = determiner nei1, go2, dai6 14. dir = directional verb lok6, soeng5, ceot1, jap6, lai4 15. ex = expressive utterance baai1baai3, zou2san4 16. gen = genitive marker ge3 17. ins=emphatic inserted marker gwai2 (as in hou3 gwai2 leng3) 18. nn = noun ping4gwo2, ba4ba1 19. nnloc = locative noun phrase soeng6mien6, leoi3mien6 20. nnpr = pronoun ngo5, nei3, keoi3 21. nnpp = proper name tin1faan4, zeon3zeon3 22. neg = negative morpheme m4, mai6, mou5 23. prt = post-verbal particle faan1, sai3, can1, maai4, gwo3, ha2 24. prep = preposition tung4maai2, hai2, bei2 25. q = quantifier jat1, saam1, sap6, gei2, mui5 26. rfl = reflexive pronoun zi6gei2 27. sfp = sentence final particle &la3, &ga1 &ma3, &ge3 &le1 28. vd = ditransitive verb bai2, bei2 29. verg = ergative verb dit3 30. vf = function verb hai6, jau5, hai2 31. vi = intransitive verb siu3 32. vt = transitive verb teoi1 33. wh = wh words mat1, mat1je5, dim2, dim2gaai2, dim2jeong2 4. Chinese characters a) The corpus has three versions, all for use in the DOS environment. The Chinese version requires the use of Eten 3.5 or later versions. As the data contain Cantonese characters which are not found in the standard GB or Big-5 character set, we have created userfonts to represent these Cantonese characters which are in common use in Hong Kong, but not in China or Taiwan. Anyone using the Chinese version of the corpus will need to copy the following files (which come with the corpus) to their Eten subdirectory: usrfont.15m usrfont.24m b) The romanized version is derived from the Chinese tagged corpus by means of a conversion program based on a dictionary. Since a character may have different pronunciations (due to language variation or context), the romanized data files sometimes give more than one romanized form for a single character, separated by '^', a convention suggested by Brian MacWhinney. Thus, for example, the Cantonese morpheme for 'you' can have an alveolar lateral initial or an alveolar nasal initial. The morpheme will be rendered as 'lei^nei' in the romanized data. The romanized corpus contains the categorial tags below each romanized utterance, but it does not contain English glosses. In time, we hope to seek resources to enable us to disambiguate the romanized forms, and provide English glosses. Both the Chinese version as well as the romanized version (without Chinese characters) are available by ftp from the following sites: ftp address: humanum.arts.cuhk.hk -for the Chinese-only corpus: /usr2/ftp/pub/Faculty/lee_thomas/Canton_Corpus/Cantonese -for the romanizations-only corpus: /usr2/ftp/pub/Faculty/lee_thomas/Canton_Corpus/Romanized c) The CHAT version now in the Childes archive is a version that incorporates the Chinese characters on a '%can' tier, with the romanizations on the main tier. This amalgamation was done first by Brian MacWhinney, whose help and advice in the final stages of the corpus preparation is gratefully acknowledged, and then checked by the research team. This version has passed the CHECK test for format consistency. 5. Acknowledgments The creation of this corpus was made possible by a three-year grant to Thomas Hun-tak Lee (Chinese University of Hong Kong), Colleen H Wong (Hong Kong Polytechnic University), and Samuel Leung (University of Hong Kong) [RGC earmarked grant CUHK 2/91]. The project was supported by two studentships from the Hong Kong Polytechnic awarded to Patricia Man and Alice Cheung, and a studentship from the University of Hong Kong awarded to Kitty Szeto. In addition, funding for the later stages of the project was provided by a direct grant from Faculty of Arts, Chinese University of Hong Kong, a grant from the Freemason's Fund for East Asian Studies, as well as research assistantships from the Hong Kong Polytechnic University. The support of these funding agencies is hereby acknowledged. Further details are given in the following report, which should be cited if data from this corpus are used: Lee, Thomas H.T., Colleen H Wong, Samuel Leung, Patricia Man, Alice Cheung, Kitty Szeto, and Cathy S P Wong, "The Development of Grammatical Competence in Cantonese-speaking Children", Report of RGC earmarked grant 1991-94.