吃什么升血压| 辛属什么五行| 斯凯奇鞋是什么档次| 除了胃镜还有什么检查胃的方法吗| 氯化钠是什么| 尿蛋白高吃什么食物好| 八字伏吟是什么意思| zara是什么牌子| 勇者胜的上半句是什么| 阿司匹林和阿莫西林有什么区别| 脑白质病变是什么病| 开门杀是什么意思| 耳鸣什么原因引起的| 孕妇吃什么水果最好| 冲锋衣是什么意思| 什么水果可以泡酒| poss是什么意思| 上火了吃什么药好| 白斑不能吃什么| 鸭子烧什么配菜好吃| 生蚝和牡蛎有什么区别| 碱性磷酸酶是什么意思| 大便出血是什么原因引起的| 七月份有什么水果| 吃刺猬有什么好处| 应激反应是什么意思| 儿童肺炎吃什么药| 失眠有什么特效药| 平常平时叫什么日| 新生儿便秘怎么办什么方法最有效| 平纹布是什么面料| 过敏性鼻炎用什么药最好| 丁字是什么意思| 石榴花什么时候开花| 开火车什么意思| 室性早搏吃什么药最好| 1927年属什么| 蛋白粉什么时候吃效果最好| 各什么各什么| 荷尔蒙是什么意思啊| 7月10号是什么星座| 脊椎和脊柱有什么区别| 什么血型是万能血型| 7月29日是什么星座| 天麻不能和什么一起吃| 驼鸟吃什么食物| 阴囊潮湿吃什么| 土猪肉和普通猪肉有什么分别| 头上长痣代表什么| 成都有什么特色美食| 狗为什么喜欢吃人屎| 天上的星星是什么| 圣水是什么| 上海什么房子不限购| 坐地能吸土是什么意思| 背痛是什么原因| 脑ct都能查出什么病| 奥美拉唑有什么副作用| 风湿类风湿有什么症状表现| 张牙舞爪的张是什么意思| olp是什么意思| 天空为什么是蓝色的| 兵马俑在什么地方| 什么是高纤维食物| 1893年属什么| 飘零是什么意思| 喝冰美式有什么好处| 位移是什么| 脑供血不足吃什么食物好| 吃白糖有什么好处和坏处| 属鸡的女生和什么属相最配| 为什么一到晚上就痒| 头上戴冠是什么生肖| 胃经常胀气是什么原因| 内窥镜是做什么检查| 卢字五行属什么| 眼发花是什么病的征兆| 人养玉三年玉养人一生是什么意思| 抚琴是什么意思| 心室早复极是什么意思| 什么血型最招蚊子咬| 瓜子脸适合剪什么发型| 14楼五行属什么| 坏血病的症状是什么| 干燥症是什么症状| 头疼恶心是什么原因| 黄牌车是什么意思| 脚磨破了涂什么药| 吃什么保养子宫和卵巢| 吃榴莲有什么好处| 嘴巴苦什么原因| 年轻人血压高是什么原因引起的| 舒服是什么意思| 地软是什么| 性欲是什么| 成双成对是什么数字| 钧字五行属什么| 小孩抽动症是什么引起的| 氮泵有什么作用| 胃食管反流病是什么原因造成的| 鼠配什么生肖最好| 左心室强光点是什么意思| 指甲盖上有竖纹是什么原因| 十月二十六是什么星座| 甲亢去医院挂什么科| 浅蓝色是什么颜色| 步履维艰是什么意思| reads是什么意思| 预约转账什么时候到账| 身心疲惫是什么意思| 生理期是什么意思| 1936年是什么年| 胆小如鼠的意思是什么| 个体户是什么职业| 嘴巴很臭是什么原因引起的| 手指头脱皮是什么原因| 梦到下雪是什么意思| 脊椎挂什么科| 口腔苦味是什么原因| 维生素c高是什么原因| 满人是什么民族| 岳飞是什么生肖| 厉兵秣马是什么意思| 宽宏大度是什么生肖| decaf是什么意思| 贫血吃什么| 角先生是什么| 什么云见日| 刮痧的痧是什么东西| 啼笑皆非的意思是什么| lof是什么意思| 孕妇鼻炎犯了可以用什么药治疗| 胃不舒服吃什么好| 沙蚕是什么动物| 多多保重是什么生肖| 下眼袋大是什么原因引起的| 糖化是什么意思| gbm是什么意思| 梦见芝麻是什么意思| 牡丹花什么时候开花| 夹不住尿是什么原因| 蘑菇和什么不能一起吃| 奶糕是什么| 上呼吸道感染吃什么消炎药| 喝雄黄酒是什么节日| 梦见白发是什么意思| ldpe是什么材料| 公粮是什么意思| 逻辑性是什么意思| 长期服用二甲双胍有什么副作用| 水乳什么牌子好用| 一边脸大一边脸小是什么原因| 下雨为什么会打雷闪电| 土霉素喂鸡有什么作用| 酩酊是什么意思| 过刚易折什么意思| 大致是什么意思| 梦见剪头发预示什么| 雪碧喝多了有什么危害| 恐龙是什么时候灭绝| 绝对值是什么意思| 痔疮看什么科| 1997年什么命| 腹膜后是什么位置| 脑萎缩吃什么药最好| 颇有是什么意思| 老鸨什么意思| 西瓜不能和什么一起吃| 高温中暑吃什么药| 孕妇晚餐吃什么比较好| 梦见抓蝎子是什么意思| 河图洛书是什么意思| 卡地亚蓝气球什么档次| 榴莲对子宫有什么好处| 舌苔厚白是什么原因| 狂犬疫苗打在什么部位| 美特斯邦威是什么档次| 为什么隔夜茶不能喝| 小孩子为什么会得抽动症| 什么的夏夜| 复读是什么意思| 不悔梦归处只恨太匆匆是什么意思| 教唆什么意思| 万事大吉是什么意思| nt检查前需要注意什么| spf是什么意思| 中国一词最早出现在什么时候| 读书与吃药是什么生肖| 抗组胺是什么意思| 月经期吃什么好| 脂肪肝什么意思| 无住生心是什么意思| 见性成佛是什么意思| 验尿能检查出什么| 用眼过度用什么眼药水| 撑台脚是什么意思| 肠炎吃什么食物| THENORTHFACE什么牌子| 狮子住在什么地方| 基质是什么| 腊肠炒什么菜好吃| 银饰变黑是什么原因| 赞字五行属什么| 什么叫脂肪瘤| 发烧吃什么水果| 鸡皮肤是什么原因引起的| min是什么| angelababy是什么意思| 中暑吃什么好得快| 森达属于什么档次的鞋| 平痛新又叫什么| newear是什么牌子| 血压低吃什么补得快| champion什么意思| 不止是什么意思| 世界上最长的河流是什么| 廉租房和公租房有什么区别| 小叶紫檀有什么功效| 牙龈肿大是什么原因| 肺部微结节是什么意思| 舌苔发白厚吃什么药| 超度什么意思| 黑蚂蚁泡酒有什么功效| 尿频吃什么药好| 7年之痒是什么意思| 窦炎症是什么病| 梦见鸡死了是什么预兆| hpv是什么| 成人达己是什么意思| 睾丸疼是什么原因| 什么滔滔| 湿疹是什么原因造成的| 蟑螂喜欢什么样的环境| 破产是什么意思| 左侧附件区囊性回声是什么意思| 吃什么治疗便秘| 心火旺失眠吃什么药| qty什么意思| 球是什么意思| 什么的黎明| 和田玉和翡翠有什么区别| 车仔面为什么叫车仔面| 仪态什么什么| 什么人| 补气血吃什么食物最好| 芒果吃多了有什么坏处| 白细胞高一点点是什么原因| 乳腺增生什么意思| 动卧是什么意思| 暗戳戳是什么意思| 梦见种花生是什么意思| 阿胶补血口服液适合什么人喝| 倒走对身体有什么好处| 老年人经常头晕是什么原因造成的| 子宫小结节是什么意思| 什么是认证| 立刀旁与什么有关| 二郎腿为什么叫二郎腿| 锦绣未央什么意思| 早起胃疼是什么原因导致的| 儿保科主要是检查什么| 好五行属什么| 母亲节送妈妈什么| 男士感染霉菌用什么药| 百度

农业部:2020年农产品加工转化率达68%

Cultural elements, colloquialisms require local solutions, experts say

By OASIS HU in Hong Kong | China Daily | Updated: 2025-08-04 07:42
Share
Share - WeChat
LU PING/CHINA DAILY
百度 该白皮书内容一发布,就引起了社会广泛关注。

"Is there a wife in a wife cake?" Heyson He Lixi, an influencer promoting Cantonese, asked.

"There is no actual wife in a wife cake, as it is just a traditional Hong Kong pastry," answered Sensechat, the world's first Cantonese large language model, a ChatGPT-like product, in fluent Cantonese.

He raised a more challenging question, "What is sorrowful rice?"

"Sorrowful rice, a well-known delicacy in Hong Kong, derives its name from the martial arts masterpiece The Return of the Condor Heroes by Hong Kong writer Jin Yong. In the novel, the character Yang Guo experiences deep sorrow after losing his beloved, Xiaolongnyu," Sensechat responded.

He was impressed — it was the first artificial intelligence tool he had encountered that exhibited a profound understanding of Cantonese culture.

Born in 1998, He is fond of Cantonese and has been creating videos promoting the language since college. Over the years, he has amassed a fan base that surpasses 700,000 across various online platforms, including YouTube and popular Chinese mainland social media platforms like Bilibili, Xiaohongshu, and Douyin. Due to the recent AI craze, He started producing videos evaluating the tools' ability to understand Cantonese.

The outcomes are underwhelming. Many AI large language models (LLMs) and AI software developed on the mainland fail to recognize Cantonese phonetically. Some Western-developed AI software can listen to Cantonese, but cannot speak it accurately. ChatGPT, for instance, often blends Cantonese with Mandarin. Suno, an AI large language model tool that specializes in generating songs, can pronounce Cantonese to a degree, but its primary focus remains music creation.

In July, the Sensetime Group, an AI developer based in Hong Kong, introduced Sensechat, a Cantonese version of its proprietary LLM, and announced that it would be available for free to Hong Kong users indefinitely.

Upon a friend's recommendation, He downloaded Sensechat.

"I felt 85 percent satisfied with Sensechat," he said. "The application still requires to be further refined, but it is one of the few that can truly understand Cantonese."

The application emphasizes one of the unique traits of Cantonese — its colloquial nature.

Pronunciation of Cantonese involves extensive use of modal particles, which are often used at the end of sentences to indicate mood. These particles usually go unnoticed by most AI tools, but Sensechat captures them effectively.

In terms of written text, Sensechat can understand and reflect the nuances between the two forms of written Cantonese. It has a standardized form used in formal situations, similar to Mandarin, and a phonetic style for everyday use. This characteristic, He said, is often overlooked by other large language models.

He recorded his interactions with Sensechat, and shared it online, garnering over 150,000 views. "Cantonese speakers truly need such a tool," He said.

Data size matters

Training an LLM typically involves three stages, said Cao Jiannong, the chair professor in the Department of Computing at Hong Kong Polytechnic University.

The first stage requires pre-training using extensive data, followed by fine-tuning with high-quality data. In the third stage, humans are needed to align the output of the LLM with local culture, ethics, morals, laws, and other rules to restrict the risk of generating inaccurate, biased, or unlawful content.

Developing a Cantonese LLM faces difficulties in all three stages, Cao said.

While Hong Kong's internet infrastructure is relatively well-developed, there is a scarcity of Cantonese content available online. A major factor contributing to this scarcity is that while Cantonese is widely spoken in daily life, the written form of Cantonese is Chinese.

Moreover, English has long served as the official language in Hong Kong. Consequently, a significant portion of the city's online information, including official archived documents in areas such as law, finance, politics, and medicine, is predominantly available in English, Cao said.

LLMs rely heavily on abundant data for their training, said Francis Fong Po-kiu, honorary president of the Hong Kong Information Technology Federation, a local IT-related business association. Without data, there is simply no way to develop a language model, he said.

Literature scarcity

Cantonese web resources suffer not only from a shortage in quantity, but also a lack of quality, said Cao.

When it comes to written material, Hong Kong has not prioritized literature, resulting in a scarcity of quality Cantonese literary works, said Keith Li King-wah, chairman of Hong Kong Wireless Technology Industry Association.

Most available Cantonese texts come from online forums and social media, and often contain low-quality and even offensive language, potentially leading AI models to produce crude content, Li said.

Collecting speech data presents another problem.

Despite access to Cantonese videos online, such as movies and TV dramas, they cannot be used due to background noise, said Albert Lam Yun-sang, the chief technology officer and chief scientist at Fano Labs, a Hong Kong-based startup focusing on speech and language technologies.

Besides insufficient data, Cantonese's intricate linguistic characteristics are another obstacle in training an AI model.

The Economist magazine analyzed language learning time, and found that mastering Cantonese requires 88 weeks of study, placing it alongside Mandarin, Arabic, Japanese, and Korean in the top five most difficult languages to learn.

Lu Lewei, director of the Sensetime Research Institute, said that Cantonese is highly colloquial with numerous inflections. It has nine tones and even a slight variation in pronunciation can alter a word's meaning.

The language also features a blend of Chinese and English and a mix of old and modern terms.

In language modeling, the simplicity of a language offers advantages. The more complex the language is, the harder for the AI model to learn about it, Lam said.

Furthermore, underlying Cantonese is the local culture, which can be challenging for those tasked with aligning the output of large language models, Cao said.

Urgent need

Despite the difficulties involved in creating Cantonese AI models, demand for them is undeniable, said Fong from the Hong Kong Information Technology Federation.

The global Cantonese-speaking population is nearly 120 million, and 85.2 million of those are native Cantonese speakers.

In Hong Kong, 6.3 million residents, or 88.2 percent of the city's population, use Cantonese as their spoken language. In other cities within the Guangdong-Hong Kong-Macao Greater Bay Area, Cantonese is the predominant dialect, with 67 million residents in Guangdong province conversing in it.

In the future, AI will be akin to today's computers and fundamentally a tool for the general public. Without Cantonese AI tools, Cantonese-only speakers may encounter significant inconvenience and marginalization in both the offline and online world, Cao said.

For a city, lack of AI expertise could result in decreased productivity in sectors such as education, healthcare, finance, and law. These limitations could impede the whole city's development, Cao added.

Fong said AI models from other countries or regions may struggle to grasp Cantonese culture accurately. This could lead to cultural or political misinterpretations, resulting in the spreading of incorrect messages.

Dependence on outside AI models could make privacy and security vulnerable, Fong said.

Government officials, for instance, might face national security risks and local companies might leak data if they inadvertently disclose sensitive information to the models developed in foreign jurisdictions, he added.

Fong urged the Hong Kong Special Administrative Region government and local organizations to develop Cantonese LLMs.

In July, Sun Dong, Hong Kong's Secretary for Innovation, Technology, and Industry, announced that the SAR government is cooperating with local universities to develop a Hong Kong-based large language model.

A document co-pilot application for civil servants is now being used on a trial basis.

The model has already been implemented in Sun's department and the system will eventually become available to all Hong Kong residents, the secretary said.

The bureau said plans are underway to expand the pilot application to three other government bureaus, but it gave no indication when Hong Kong residents would gain access to it.

Fong said if it could be launched successfully, the government LLM would have many benefits.

It would be a positive step in resolving the issue of some Western AI models limiting their usage in Hong Kong. Also, implementing a localized AI model could safeguard privacy and provide more convenience to residents, Fong said.

Cao said it's unclear what specific features the government's AI model could offer and how it would distinguish itself from other similar products.

"I don't think the government has done enough research on what they want to do," Cao said.

Local startups

Local technology companies, meanwhile, are actively meeting the needs of the Cantonese-speaking market.

One startup, Votee AI, developed an opensource Cantonese LLM this year.

After years of operating in the local market, Votee AI has gathered substantial amounts of open-source Cantonese data along with primary data.

Taking a community-centered approach, they have also collaborated with local Cantonese linguists and AI researchers, including the team behind the online Cantonese dictionary "words.hk", to capture the nuances of Hong Kong speech.

Sensetime has also accumulated a vast reservoir of internal open-source data.

The company has synthesized data by leveraging advanced technologies and bought supplementary information from external channels to collect data.

To combat the shortage of high-quality Cantonese data, Sensetime also collected audio Cantonese data from hundreds of its local employees.

Sensechat's clients include customer service providers, financial institutions, legal firms, healthcare companies, and others.

For Hong Kong residents, the company promises to provide the service for free indefinitely for free on both the web version and mobile application.

A local tech industry insider, who chose to stay anonymous, said Sensechat should opensource its technology to allow more residents and organizations to access it freely, to benefit the city.

After trying the Sensechat platform, he said its understanding of some Hong Kong slang could be more precise. Nonetheless, "it should be recognized that Sensechat filled a void in the local market," he said.

Cultural roots

In addition to developing local AI models, existing mainstream language models should be encouraged to improve their Cantonese functions, said Li from the Hong Kong Wireless Technology Industry Association.

However, mainstream AI language models are primarily developed by commercial entities in the West. Without market demand, they may not be willing to enhance their products' Cantonese capabilities.

Li believes the Hong Kong SAR government and local organizations should take the lead in collecting Cantonese data, digitize cultural content, and share these resources openly to enrich the Cantonese body of information.

Cantonese speakers can also actively use the language to engage with mainstream AI language models.

These actions can demonstrate to AI model developers that there is a market demand for Cantonese, while interaction with these models can also enhance their understanding of Cantonese culture.

The key to encouraging more people to use Cantonese lies in making Cantonese culture appealing, Li said.

Language is not just a communication tool; it encapsulates the cultural essence and identity of its speakers, he said.

The marginalized status of Cantonese in the digital sphere is a reflection of the decline of the cultural significance of the region.

In the 1970s and 1980s, Hong Kong, although just a city, was so culturally influential that Cantonese was a popular language around the world, Li said.

"At that time, the whole world watched Hong Kong movies and TVB(television shows), knew Jackie Chan and Bruce Lee, and sang Cantonese songs. However, in the present day, even many students in Hong Kong cannot speak Cantonese," he said.

"The focus of government policies should not only be on technology, but also on culture."

He, the influencer, said he learned Cantonese from his grandparents when he was a child, which later made him more proficient in the language than other school students. The confidence this gave him motivated him to become a Cantonese blogger.

However, as He aged, Cantonese became so marginalized that even voice-operated devices and software in his home failed to understand Cantonese commands.

While He could communicate with these devices in Mandarin and English, his grandparents, who only speak Cantonese, struggled to keep pace.

He hopes that Cantonese LLMs will one day help his elderly grandparents manage their daily lives through voice-controlled apps capable of understanding Cantonese.

Top
BACK TO THE TOP
English
Copyright 1995 - . All rights reserved. The content (including but not limited to text, photo, multimedia information, etc) published in this site belongs to China Daily Information Co (CDIC). Without written authorization from CDIC, such content shall not be republished or used in any form. Note: Browsers with 1024*768 or higher resolution are suggested for this site.
License for publishing multimedia online 0108263

Registration Number: 130349
FOLLOW US
嘉兴有什么大学 长智齿牙龈肿痛吃什么药 腱鞘炎是什么原因引起的 鬼火是什么意思 3岁打什么疫苗
八仙过海是什么生肖 嗜酸性粒细胞偏高是什么原因 羊传染人的病叫什么名 邯郸学步的寓意是什么 不甘心是什么意思
内痔用什么药治最好效果最快 md是什么牌子 睾酮是什么 妊娠高血压对胎儿有什么影响 贫血有什么危害
眼睛肿了用什么药 失去味觉是什么原因 处暑是什么节气 1963年的兔是什么命 纳呆是什么意思
阿玛尼手表算什么档次hcv8jop3ns2r.cn 什么是回迁房hcv9jop8ns2r.cn 离子水是什么水helloaicloud.com 龙虾吃什么食物hcv8jop6ns5r.cn 当兵有什么要求hcv9jop0ns0r.cn
梦见做饭是什么意思hcv8jop0ns4r.cn 余字五行属什么hcv9jop2ns6r.cn 脂溢性脱发是什么意思hcv7jop9ns1r.cn 盆腔炎用什么药最好hcv8jop0ns6r.cn 苹果什么时间吃最好hcv9jop7ns9r.cn
thc是什么费用hcv8jop0ns3r.cn 营养土是什么土hcv8jop4ns5r.cn 先兆性流产是什么意思hcv9jop1ns8r.cn 为什么刚小便完又有尿意hcv8jop8ns6r.cn 感想是什么意思hcv7jop9ns6r.cn
日的偏旁有什么字hcv9jop7ns9r.cn 手脚麻木吃什么药最管用hcv8jop1ns8r.cn 失眠睡不着是什么病hcv9jop0ns3r.cn 想留不能留才最寂寞是什么歌cl108k.com 天上的云像什么hcv8jop2ns5r.cn
百度