A Hong Kong Cantonese Child Language Corpus 1. The corpus This database contains longitudinal data on the language of 8 Cantonese-speaking children, each recorded for approximately one year. The corpus contains 171 files coded in Childes format and tagged with a set of 33 word class labels. These children were observed in their interactions with the caretakers, the investigator, and occasionally other adults who chatted with the children during the visits. Three research students carried out the observations and the recording. Patricia Man recorded Bohuen and Gakie; Alice Cheung recorded Bernard, Tsuntsun and Tinfaan; and Kitty Szeto recorded Johnny, Jenny and Chunyat. The names of the children and the ages during which they were recorded are as follows: NAME Sex Age at which recording No. of files began and ended Bohuen (wbh) F 2;03;23 - 3;04;08 27 Gakei (cgk) F 1;11;01 - 2;09;09 19 Bernard (mhz) M 1;07;00 - 2;08;06 26 Tsuntsun (ckt) M 1;05;22 - 2;07;22 25 Tinfaan (ltf) F 2;02;10 - 3;02;18 16 Johnny (hhc) M 2;04;08 - 3;04;14 16 Jenny (LLy) F 2;08;10 - 3;08;09 20 Chunyat (ccc) M 1;10;08 - 2;10;27 22 total 171 The file name indicates the name of the child and his/her age at the time of the recording. Each filename is made up of the initials of the child (the first three characters) and his/her age at the time of recording, in terms of Year (1 character), Month (2 characters), and Day (2 characters). All files have the suffix 'tag'. For instance, the file WBH20322.tag contains tagged utterances of Bohuen (whb) when she was 2 years 3 months and 22 days old. The files of the 8 children and their respective sizes are listed below: Bohuen (27 files) WBH20323 TAG 48,310 WBH20926 TAG 58,567 WBH20330 TAG 36,851 WBH21002 TAG 24,547 WBH20402 TAG 20,993 WBH21016 TAG 30,688 WBH20405 TAG 23,270 WBH21023 TAG 32,193 WBH20406 TAG 33,430 WBH21106 TAG 18,256 WBH20413 TAG 22,811 WBH21114 TAG 29,467 WBH20414 TAG 11,570 WBH21128 TAG 33,004 WBH20415 TAG 9,842 WBH30010 TAG 36,238 WBH20506 TAG 19,630 WBH30101 TAG 39,535 WBH20609 TAG 20,679 WBH30220 TAG 46,243 WBH20619 TAG 11,922 WBH30303 TAG 31,646 WBH20703 TAG 21,648 WBH30312 TAG 36,715 WBH20714 TAG 52,696 WBH30408 TAG 60,664 WBH20919 TAG 49,106 Gakei (19 files) CGK11101 TAG 88,455 CGK20318 TAG 32,461 CGK11108 TAG 38,193 CGK20325 TAG 31,182 CGK11122 TAG 21,503 CGK20408 TAG 45,931 CGK11129 TAG 55,229 CGK20430 TAG 68,227 CGK20008 TAG 35,575 CGK20503 TAG 36,682 CGK20207 TAG 42,112 CGK20711 TAG 59,831 CGK20221 TAG 68,347 CGK20808 TAG 65,528 CGK20228 TAG 31,564 CGK20818 TAG 79,915 CGK20304 TAG 56,490 CGK20909 TAG 95,250 CGK20311 TAG 37,122 Bernard (26 files) MHZ10700 TAG 43,806 MHZ20115 TAG 108,119 MHZ10800 TAG 82,730 MHZ20129 TAG 83,027 MHZ10814 TAG 30,205 MHZ20212 TAG 80,311 MHZ10828 TAG 29,063 MHZ20226 TAG 104,314 MHZ10904 TAG 109,848 MHZ20309 TAG 92,704 MHZ10918 TAG 77,601 MHZ20328 TAG 115,580 MHZ10925 TAG 110,265 MHZ20407 TAG 94,178 MHZ11010 TAG 99,495 MHZ20421 TAG 116,225 MHZ11023 TAG 109,863 MHZ20504 TAG 91,050 MHZ11106 TAG 95,845 MHZ20519 TAG 91,565 MHZ20003 TAG 99,218 MHZ20604 TAG 88,006 MHZ20016 TAG 105,301 MHZ20618 TAG 135,700 MHZ20101 TAG 127,255 MHZ20806 TAG 114,159 Tsuntsun (25 files) CKT10522 TAG 18,620 CKT20016 TAG 100,568 CKT10703 TAG 10,142 CKT20108 TAG 127,686 CKT10710 TAG 12,637 CKT20205 TAG 146,398 CKT10800 TAG 104,711 CKT20215 TAG 159,272 CKT10807 TAG 97,010 CKT20303 TAG 132,050 CKT10821 TAG 94,224 CKT20317 TAG 109,179 CKT10907 TAG 110,450 CKT20400 TAG 123,376 CKT10914 TAG 51,177 CKT20414 TAG 111,579 CKT10929 TAG 126,734 CKT20500 TAG 128,133 CKT11030 TAG 117,009 CKT20514 TAG 107,821 CKT11113 TAG 112,869 CKT20618 TAG 125,791 CKT11127 TAG 102,393 CKT20702 TAG 130,499 CKT20009 TAG 142,026 Tinfaan (16 files) LTF20210 TAG 81,557 LTF20802 TAG 120,856 LTF20302 TAG 79,287 LTF20824 TAG 112,592 LTF20330 TAG 83,173 LTF20907 TAG 99,085 LTF20427 TAG 93,000 LTF21018 TAG 103,031 LTF20518 TAG 67,563 LTF21116 TAG 124,369 LTF20601 TAG 91,859 LTF30020 TAG 119,667 LTF20705 TAG 115,502 LTF30121 TAG 122,662 LTF20720 TAG 108,327 LTF30218 TAG 112,754 Johnny (16 files) HHC20408 TAG 27,537 HHC20930 TAG 91,674 HHC20503 TAG 108,963 HHC21013 TAG 95,724 HHC20513 TAG 145,320 HHC21108 TAG 123,220 HHC20519 TAG 142,874 HHC30008 TAG 90,986 HHC20610 TAG 109,595 HHC30116 TAG 103,856 HHC20624 TAG 109,195 HHC30216 TAG 94,360 HHC20721 TAG 113,419 HHC30311 TAG 117,899 HHC20808 TAG 89,101 HHC30414 TAG 105,328 Jenny (20 files) LLY20810 TAG 49,414 LLY30113 TAG 108,249 LLY20822 TAG 62,064 LLY30130 TAG 111,824 LLY20909 TAG 20,464 LLY30213 TAG 109,727 LLY20914 TAG 91,461 LLY30315 TAG 94,778 LLY20928 TAG 108,323 LLY30326 TAG 91,557 LLY21101 TAG 79,233 LLY30422 TAG 103,201 LLY21108 TAG 102,046 LLY30520 TAG 86,685 LLY21129 TAG 116,200 LLY30616 TAG 103,543 LLY30011 TAG 92,309 LLY30725 TAG 104,530 LLY30022 TAG 95,716 LLY30809 TAG 97,213 Chunyat (22 files) CCC11008 TAG 44,585 CCC20523 TAG 120,525 CCC11100 TAG 19,170 CCC20608 TAG 111,275 CCC11121 TAG 28,616 CCC20624 TAG 98,160 CCC20110 TAG 98,190 CCC20706 TAG 117,815 CCC20117 TAG 57,474 CCC20713 TAG 111,903 CCC20206 TAG 83,178 CCC20800 TAG 114,384 CCC20213 TAG 109,816 CCC20817 TAG 87,806 CCC20307 TAG 105,992 CCC20907 TAG 115,663 CCC20323 TAG 97,494 CCC20923 TAG 102,006 CCC20410 TAG 100,141 CCC21013 TAG 111,722 CCC20507 TAG 83,663 CCC21027 TAG 123,784 2. The background of the 8 Cantonese-speaking children a) Bohuen and Gakei Both children are brought up in monolingual Cantonese-speaking working class families. Bohuen's father was working in the warehouse of a mass transport company and her mother was a part time piano teacher. The child has a younger brother who is about two years younger than her. They lived with the child's grandmother and uncle. The child had already started attending a nursery school when data collection started. After school, she was taken care of by her parents and grandmother. Gakei's father was a technician in a electronic company and her mother was a housewife. They lived with the child's grandmother. Gakei's parents were both born in Hong Kong. The child was not yet enrolled in a nursery during the whole period of data collection. She was entirely taken care of by her mother. b) Bernard, Tsuntsun and Tinfaan All three are Cantonese-speaking children living in Hong Kong. Tsuntsun and Tinfaan were born in Hong Kong, while Bernard was born in Kent, United Kingdom and was brought back to Hong Kong at the age of 8 1/2 months old. Tsuntsun is the only son of the family. His father was a Census & Survey Officer working in the government and his mother a secondary school teacher teaching Chinese and Religious Studies. Since his birth, he had been living in his maternal grandparents' house during weekdays and was taken care of by his grandmother. His parents visited him occasionally during the weekday evenings and took him back home on Friday nights to stay over the weekend. They communicated in Cantonese. When Tsuntsun was 1 year 10 months old, his mother went to study for a year in the United Kingdom. He started to attend a nursery at the age of 2 years 1 month. Bernard is the only son of the family. His father was a lecturer in the Division of Construction and Land Use of the Hong Kong Polytechnic. His mother was a lecturer of the English Language Teaching Unit of the Chinese University of Hong Kong. Bernard's mother brought him back from the United Kingdom at the age of 8 1/2 months. He was then taken care of by his maternal grandmother at her house until the age of about 1 year 1 month. From that time to the age of 2 years 6 months, he was taken care of by a caretaker during the weekdays. He communicated in Cantonese, though his parents occasionally introduced to him some English terms. He started to attend the nursery play-groups at the age of 2 years 6 months. Tinfaan is the youngest child in the family. She has a sister who is four years older. Her father was an engineer working in the government and her mother was a piano teacher teaching at home. During the first one-and-a-half years from her birth, she was taken care of mostly by a Filipino helper while her mother worked as a school music teacher. After her mother had stopped working in school, Tinfaan was mostly taken care of by her mother, except at times when her mother had to give piano lessons or had to go out, when Tinfaan would be looked after by her Filipino helper. She communicated in Cantonese except when speaking to her Filipino helper, for which she used 'something English-like' (as described by her mother). She started to attend kindergarten at the age of 2 years 9 months. c) Johnny, Jenny and Chunyat All three children were born in Hong Kong and are brought up in monolingual Cantonese-speaking families. They had not started going to a nursery during the period of data collection. Jenny is the youngest child in the family. She has an elder brother who is ten years older and an elder sister who is four years older. Jenny's father was a businessman and her mother was a housewife. The family employed a Filipino helper, who spoke some Cantonese and English to the children. Johnny is the youngest child in the family. He has an elder sister who is seven years older. His father was an engineer and her mother was a typist. The family employed a Thai helper and she spoke Cantonese to the children. Chunyat is the only son in the family. His father was a merchant and his mother taught English in a secondary school. They lived with the child's maternal grandparents. 3. Tags Below is a summary list of the syntactic categories used in coding the corpus. The romanizations are based on the Cantonese romanization scheme of the Linguistic Society of Hong Kong (LSHK) (see Matthews and Yip 1994: 400-401). Category e.g. 1. adj = adjective hung4 2. advf = focus adverb zung6, dou1, jau6, zoi3 3. advi = adverb of intensity hou3, gei2, gam13, zan1 4. advm = adverb of manner maan6maan6dei2, ma4ma4dei2 5. advs = sentential adverb bat1jyu4, gam2(joeng2), jat1cai4 6. asp = aspectual marker zo2, zyu6, gan2, gwo3, hoi1 7. aux = auxiliary / modal verb jing1goi1, hang2, ho2ji5, wui, sai2 8. cl = classifier go3, zek3, bun2, bui1, di1 9. com = comparative morpheme gwo3 (as in dai6 gwo3), di1 (as in hung4 di1) 10. conj = connective dan6hai6, tung4maai4, waak6ze2 11. corr = correlative jut6...jut6, jau6...jau6, gam2...gam3, jat1...jat1 12. ctc = clitic dak1, dou3 13. det = determiner nei1, go2, dai6 14. dir = directional verb lok6, soeng5, ceot1, jap6, lai4 15. ex = expressive utterance baai1baai3, zou2san4 16. gen = genitive marker ge3 17. ins=emphatic inserted marker gwai2 (as in hou3 gwai2 leng3) 18. nn = noun ping4gwo2, ba4ba1 19. nnloc = locative noun phrase soeng6mien6, leoi3mien6 20. nnpr = pronoun ngo5, nei3, keoi3 21. nnpp = proper name tin1faan4, zeon3zeon3 22. neg = negative morpheme m4, mai6, mou5 23. prt = post-verbal particle faan1, sai3, can1, maai4, gwo3, ha2 24. prep = preposition tung4maai2, hai2, bei2 25. q = quantifier jat1, saam1, sap6, gei2, mui5 26. rfl = reflexive pronoun zi6gei2 27. sfp = sentence final particle &la3, &ga1 &ma3, &ge3 &le1 28. vd = ditransitive verb bai2, bei2 29. verg = ergative verb dit3 30. vf = function verb hai6, jau5, hai2 31. vi = intransitive verb siu3 32. vt = transitive verb teoi1 33. wh = wh words mat1, mat1je5, dim2, dim2gaai2, dim2jeong2 4. Chinese characters a) The corpus has two versions, both for use in the DOS environment. The Chinese version requires the use of Eten 3.5 or later versions. As the data contain Cantonese characters which are not found in the standard GB or Big-5 character set, we have created userfonts to represent these Cantonese characters which are in common use in Hong Kong, but not in China or Taiwan. Anyone using the Chinese version of the corpus will need to copy the following files (which come with the corpus) to their Eten subdirectory: usrfont.15m usrfont.24m b) The romanized version is derived from the Chinese tagged corpus by means of a conversion program based on a dictionary. Since a character may have different pronunciations (due to language variation or context), the romanized data files sometimes give more than one romanized form for a single character, separated by '^', a convention suggested by Brian MacWhinney. Thus, for example, the Cantonese morpheme for 'you' can have an alveolar lateral initial or an alveolar nasal initial. The morpheme will be rendered as 'lei^nei' in the romanized data. The romanized corpus contains the categorial tags below each romanized utterance, but it does not contain English glosses. In time, we hope to seek resources to enable us to disambiguate the romanized forms, and provide English glosses. 5. Acknowledgments The creation of this corpus was made possible by a three-year grant to Thomas Hun-tak Lee (Chinese University of Hong Kong), Colleen H Wong (Hong Kong Polytechnic University), and Samuel Leung (University of Hong Kong) [RGC earmarked grant CUHK 2/91]. The project was supported by two studentships from the Hong Kong Polytechnic awarded to Patricia Man and Alice Cheung, and a studentship from the University of Hong Kong awarded to Kitty Szeto. In addition, funding for the later stages of the project was provided by a direct grant from Faculty of Arts, Chinese University of Hong Kong, a grant from the Freemason's Fund for East Asian Studies, as well as research assistantships from the Hong Kong Polytechnic University. The support of these funding agencies is hereby acknowledged. Further details are given in the following report, which should be cited if data from this corpus are used: Lee, Thomas H.T., Colleen H Wong, Samuel Leung, Patricia Man, Alice Cheung, Kitty Szeto, and Cathy S P Wong, "The Development of Grammatical Competence in Cantonese-speaking Children", Report of RGC earmarked grant 1991-94.