Creating a Multilingual Corpus from Wikipedia Files

Luke Lindemann

February 19, 2021

Wikipedia currently has versions in over 300 different languages from about forty different language families and more than thirty different scripts. This makes it ideal for the collection of sample texts for multilingual corpora. I have collected and processed Wikipedia entries for every language version that contains 100+ articles. I limited each sample to 200,000 words. The fully cleaned text corpus (110 MB) can be downloaded here, and the raw extracted files (130 MB) can be downloaded here. Basic information about each language is in a chart at the end of this document. Below, I detail the process for creating a text corpus from Wikipedia.

Processing Wikipedia Files

The most recent text of any Wikipedia version can be downloaded as a Wikimedia dump file here. The complete list of languages and the Wikicodes for each language are here. The dump files extract to xml with Wikipedia mark-up code. In the past, I have used the WikiCorpus function from the Python Gensim module to process the texts. I believe this method works well for building a single massive English corpus, but I have been unable to use it adequately for my purpose - building a corpus of smaller sample texts from hundreds of dump files in different languages.

I typically only want to extract the first 200 or so articles from a file rather than process the entire file, which can run multiple hours for languages with millions of articles like Spanish and English. Processing all 300+ wikipedia versions has taken me multiple days in the past. Furthermore, for many languages the processing fails to remove much of the markup code (especially text formatting), tables, and long irrelevant lists of topics or languages at the bottom of an article. As far as I am aware, the cleaning is done entirely through regular expressions, which is inadequate for handling recursively-nested templates. Additionally, it deletes characters in some scripts, particularly Brahmi-derived Indic scripts like Devanagari.

I wrote a Python program, WikiTextExtract, which is available here, to extract text from a folder containing Wikipedia dump files. My priority is to capture long stretches of running text, so it is aggressive in filtering out markup code, lists, tables, and templates. You can set the maximum number of lines to read, which allows a portion of the file to be processed quickly, as well as a maximum number of either full articles or words to extract, and it can filter out short articles that pass a minimum word threshold. Each article is listed separately on a metadata line along with its title and the number of words in the article.

For this corpus, I set a limit of 200,000 words for each language (or, for languages with scripts that do not use spaces for words - Burmese, Chinese, Classical Chinese, Dzongkha, Gan, Japanese, Khmer, Lao, Mon, Shan, Thai, Tibetan, Yue - the first 100 articles). About half of the versions contain fewer than 200,000 words total. Each accepted article must be 100 words or longer, to avoid articles which contain formulaic entries and lists (for a small number of languages which only contain short entries - Cheyenne, Cree, Inuktitut, Inupiak, Kongo, Greenlandic, Kashmiri, Lak, Madurese, Nauruan, Norfolk, Kirundi - I lowered the threshold to ten words and deleted repetitive entries by hand). I then did a quick eye scan through the resulting files and deleted articles or portions of text in irrelevant languages, as well as remaining lists and headings (often I deleted the head article because it does not contain encyclopedia-style running text).

Cleaning the Texts

WikiTextExtract does not clean the text for processing. In other words, the resulting text leaves intact the uppercase letters, punctuation, indentation, etc. This is because I wanted it to be easy to read the extracted articles. For the cleaned texts, I used R to scan the text (ignoring the titles and word counts), making everything lowercase, and (for the Latin, Greek, Cyrillic, Arabic, and Hebrew scripts) deleting rare characters which occur with a less than .01% frequency.

I then restricted the text to characters that occur within the Unicode block of that language's script so that, for example, a Cyrillic text is cleaned of Latin and Arabic characters used in loanwords or quoted text. Six Wikipedia versions use multiple scripts (Cree, Tatar, Buginese, Pali, Inupiak, and Kashmir). For these languages, about half the articles are written in one script and half are written in the other. For these I created two separate texts, restricting by one script and then the other.

Deleting punctuation is trickier than using a [[:punct:]] regular expression, because many languages use punctuation symbols in their main orthographies. A noteworthy example is the apostrophe, and variant symbols, which are used phonemically in the Latin-based scripts of Acehnese, Aymara, Buginese, Central Bicolano, Chamorro, Cheyenn, Fula, Gorontalo, Guarani, Hawaiian, Igbo, Lojban, Madurese, Navajo, Malagasy, Maori, Nias, Quechua, Samoan, Tatar, Tetum, Uzbek, and Xhosa. There are also languages which use punctuation symbols in digraphs, including Breton and Catalan.

Update from January 24, 2024: The author of this blog post helpfully pointed out that the [[:punct:]] regular expression deletes characters (\u104c-\u104f) in the Burmese text. I have updated that text to fix the issue.

For the other scripts, I did my best to research the punctuation and numeral characters for each script so that I could exclude them. I hope to have created a fairly simple and effective method for extracting text from Wikipedia pages.

The chart below gives basic information about each language.

Language WikiCode Family Subgroup Script Word Count
Abkhazian ab Caucasian Caucasian Cyrillic 58,100
Acehnese ace Malayo-Polynesian Other Malayo-Polynesian Latin 90,025
Adyghe ady Caucasian Caucasian Cyrillic 10,791
Afrikaans af Germanic Dutch Latin 198,349
Akan ak Niger-Congo Other Niger-Congo Latin 82,731
Albanian sq Albanian Albanian Latin 193,060
Alemannic als Germanic High German Latin 197,012
Amharic am Afro-Asiatic Semitic Ethiopic 193,388
Anglo Saxon ang Germanic Anglic Latin 109,182
Arabic ar Afro-Asiatic Semitic Arabic 195,629
Aragonese an Romance Iberian Latin 210,697
Aramaic arc Afro-Asiatic Semitic Syriac 9,236
Armenian hy Armenian Armenian Armenian 194,344
Aromanian roa_rup Romance Other Romance Latin 15,126
Assamese as Indic Eastern Indic Bengali 189,182
Asturian ast Romance Iberian Latin 206,485
Atikamekw atj Algonquian Algonquian Latin 34,454
Avar av Caucasian Caucasian Cyrillic 88,140
Awadhi awa Indic Central Indic Devanagari 84,403
Aymara ay Aymara Aymara Latin 23,822
Azerbaijani az Turkic Oghuz Latin 199,917
Balinese ban Malayo-Polynesian Other Malayo-Polynesian Latin 199,768
Bambara bm Mande Mande Latin 25,163
Banjar bjn Austronesian Austronesian Latin 199,761
Banyumasan map_bms Malayo-Polynesian Javanesic Latin 198,810
Bashkir ba Turkic Kipchak Cyrillic 194,138
Basque eu Vasconic Vasconic Latin 199,658
Bavarian bar Germanic High German Latin 197,795
Belarusian be Slavic East Slavic Cyrillic 192,440
Belarusian (Taraškievica) be_x_old Slavic East Slavic Cyrillic 189,806
Bengali bn Indic Eastern Indic Bengali 191,876
Bihari bh Indic Eastern Indic Devanagari 189,977
Bishnupriya Manipuri bpy Indic Eastern Indic Bengali 79,490
Bislama bi Creole English Creole Latin 3,145
Bosnian bs Slavic South Slavic Latin 195,997
Breton br Celtic Celtic Latin 206,119
Buginese (Latin) bug Malayo-Polynesian Other Malayo-Polynesian Latin 7,126
Buginese (Buginese) bug Malayo-Polynesian Other Malayo-Polynesian Buginese 2,158
Bulgarian bg Slavic South Slavic Cyrillic 194,249
Burmese my Tibeto-Burman Other Tibeto-Burman Myanmar 75,527
Buryat bxr Mongolic Mongolic Cyrillic 192,042
Cantonese zh_yue Tibeto-Burman Other Tibeto-Burman Chinese 31,888
Catalan ca Romance Gallo-Romance Latin 209,944
Cebuano ceb Malayo-Polynesian Phillipine Latin 200,597
Central Bicolano bcl Malayo-Polynesian Phillipine Latin 197,619
Chamorro ch Malayo-Polynesian Other Malayo-Polynesian Latin 6,547
Chechen ce Caucasian Caucasian Cyrillic 195,158
Cherokee chr Iroquoian Iroquoian Cherokee 11,835
Cheyenne chy Algonquian Algonquian Latin 318
Chichewa ny Niger-Congo Bantu Latin 99,634
Chinese zh Tibeto-Burman Other Tibeto-Burman Chinese 124,650
Chuvash cv Turkic Oghuz Cyrillic 188,153
Classical Chinese zh_classical Tibeto-Burman Other Tibeto-Burman Chinese 12,778
Cornish kw Celtic Celtic Latin 139,379
Corsican co Romance Italo-Dalmatian Latin 207,593
Cree (Canadian Syllabics) cr Algonquian Algonquian Canadian Syllabics 123
Cree (Latin) cr Algonquian Algonquian Latin 352
Crimean Tatar crh Turkic Kipchak Latin 54,339
Croatian hr Slavic South Slavic Latin 194,591
Czech cs Slavic West Slavic Latin 196,446
Danish da Germanic North Germanic Latin 196,664
Dinka din Nilotic Nilotic Latin 64,389
Divehi dv Indic Southern Indic Thaana 186,755
Doteli dty Indic Northern Indic Devanagari 187,576
Dutch nl Germanic Dutch Latin 197,198
Dutch Low Saxon nds_nl Germanic Low German Latin 199,215
Dzongkha dz Tibeto-Burman Other Tibeto-Burman Tibetan 24,147
Egyptian Arabic arz Afro-Asiatic Semitic Arabic 195,437
Emilian-Romagnol eml Romance Gallo-Italic Latin 162,961
English en Germanic Anglic Latin 199,564
Erzya myv Uralic Other Uralic Cyrillic 180,837
Esperanto eo Constructed Constructed Latin 197,285
Estonian et Uralic Finnic Latin 192,800
Ewe ee Niger-Congo Other Niger-Congo Latin 18,632
Extremaduran ext Romance Iberian Latin 200,577
Faroese fo Germanic North Germanic Latin 191,824
Fiji Hindi hif Indic Central Indic Latin 128,450
Fijian fj Malayo-Polynesian Other Malayo-Polynesian Latin 42,360
Finnish fi Uralic Finnic Latin 196,573
Franco-Provençal frp Romance Gallo-Romance Latin 58,163
French fr Romance Gallo-Romance Latin 233,025
Friulian fur Romance Other Romance Latin 198,007
Fula ff Niger-Congo Other Niger-Congo Latin 40,601
Gagauz gag Turkic Oghuz Latin 67,724
Galician gl Romance Iberian Latin 197,864
Gan gan Tibeto-Burman Other Tibeto-Burman Chinese 6,514
Georgian ka Kartvelian Kartvelian Georgian 190,379
German de Germanic High German Latin 197,197
Gilaki glk Iranian Iranian Arabic 182,832
Goan Konkani gom Indic Southern Indic Devanagari 197,428
Gorontalo gor Malayo-Polynesian Phillipine Latin 53,579
Gothic got Germanic Other Germanic Gothic 9,352
Greek el Hellenic Hellenic Greek 192,045
Greenlandic kl Inuit Inuit Latin 18,203
Guarani gn Tupian Tupian Latin 198,012
Guianan Creole gcr Creole Portuguese Creole Latin 115,926
Gujarati gu Indic Western Indic Gujarati 195,421
Haitian ht Creole French Creole Latin 188,176
Hakka hak Tibeto-Burman Tibeto-Burman Latin Latin 84,588
Hausa ha Afro-Asiatic Other Afro-Asiatic Latin 200,918
Hawaiian haw Malayo-Polynesian Polynesian Latin 112,870
Hebrew he Afro-Asiatic Semitic Hebrew 194,947
Hill Mari mrj Uralic Mari Cyrillic 91,189
Hindi hi Indic Central Indic Devanagari 195,809
Hungarian hu Uralic Other Uralic Latin 198,457
Icelandic is Germanic North Germanic Latin 194,966
Ido io Constructed Constructed Latin 197,718
Igbo ig Niger-Congo Other Niger-Congo Latin 188,440
Ilokano ilo Malayo-Polynesian Phillipine Latin 198,139
Inari Sami smn Uralic Sami Latin 34,021
Indonesian id Malayo-Polynesian Malayic Latin 201,308
Ingush inh Caucasian Caucasian Cyrillic 27,865
Interlingua ia Constructed Constructed Latin 195,759
Interlingue ie Constructed Constructed Latin 187,169
Inuktitut (Canadian Syllabics) iu Inuit Inuit Canadian Syllabics 1,923
Inuktitut (Latin) iu Inuit Inuit Latin 1,415
Inupiak ik Inuit Inuit Latin 692
Irish ga Celtic Celtic Latin 199,466
Italian it Romance Italo-Dalmatian Latin 204,004
Jamaican Patois jam Creole English Creole Latin 95,784
Japanese ja Japonic Japonic Japanese 74,348
Javanese jv Malayo-Polynesian Javanesic Latin 199,101
Kabardian Circassian kbd Caucasian Caucasian Cyrillic 91,720
Kabiye kbp Niger-Congo Other Niger-Congo Latin 206,498
Kabyle kab Afro-Asiatic Other Afro-Asiatic Latin 210,877
Kalmyk xal Mongolic Mongolic Cyrillic 23,437
Kannada kn Dravidian Dravidian Kannada 198,531
Kapampangan pam Malayo-Polynesian Phillipine Latin 193,277
Karachay-Balkar krc Turkic Kipchak Cyrillic 145,979
Karakalpak kaa Turkic Kipchak Latin 121,182
Kashmiri (Arabic) ks Indic Other Indic Arabic 6,256
Kashmiri (Devanagari) ks Indic Other Indic Devanagari 3,641
Kashubian csb Slavic West Slavic Latin 120,564
Kazakh kk Turkic Kipchak Cyrillic 195,972
Khmer km Austroasiatic Austroasiatic Khmer 28,173
Kikuyu ki Niger-Congo Bantu Latin 10,016
Kinyarwanda rw Niger-Congo Bantu Latin 126,721
Kirghiz ky Turkic Kipchak Cyrillic 194,186
Kirundi rn Niger-Congo Bantu Latin 32,282
Komi kv Uralic Permic Cyrillic 139,004
Komi-Permyak koi Uralic Permic Cyrillic 82,138
Kongo kg Niger-Congo Bantu Latin 21,763
Korean ko Koreanic Koreanic Hangul 203,285
Kotova avk Constructed Constructed Latin 190,260
Kurdish ku Iranian Iranian Latin 201,217
Ladino lad Romance Iberian Latin 114,922
Ladino lld Romance Gallo-Romance Latin 114,922
Lak lbe Caucasian Caucasian Cyrillic 22,564
Lao lo Tai Tai Lao 21,063
Latgalian ltg Baltic Baltic Latin 41,482
Latin la Romance Other Romance Latin 193,514
Latvian lv Baltic Baltic Latin 193,369
Lezgian lez Caucasian Caucasian Cyrillic 194,005
Ligurian lij Romance Gallo-Italic Latin 214,470
Limburgish li Germanic Other Germanic Latin 201,569
Lingala ln Niger-Congo Bantu Latin 64,511
Lingua Franca Nova lfn Constructed Constructed Latin 196,299
Lithuanian lt Baltic Baltic Latin 194,671
Livvi-Karelian olo Uralic Other Uralic Latin 126,766
Lojban jbo Constructed Constructed Latin 59,390
Lombard lmo Romance Gallo-Italic Latin 208,682
Low Saxon nds Germanic Low German Latin 199,049
Lower Sorbian dsb Slavic West Slavic Latin 131,759
Luganda lg Niger-Congo Bantu Latin 204,897
Luxembourgisch lb Germanic High German Latin 200,807
Macedonian mk Slavic South Slavic Cyrillic 199,719
Madurese mad Malayo-Polynesian Other Malayo-Polynesian Latin 46,468
Maithili mai Indic Eastern Indic Devanagari 186,573
Malagasy mg Malayo-Polynesian Other Malayo-Polynesian Latin 186,592
Malay ms Malayo-Polynesian Malayic Latin 198,741
Malayalam ml Dravidian Dravidian Malayalam 78,073
Maltese mt Afro-Asiatic Semitic Latin 241,726
Manx gv Celtic Celtic Latin 208,179
Maori mi Malayo-Polynesian Polynesian Latin 88,263
Marathi mr Indic Southern Indic Devanagari 188,409
Mazandarani mzn Iranian Iranian Arabic 198,174
Meadow Mari mhr Uralic Mari Cyrillic 199,570
Min Dong cdo Tibeto-Burman Tibeto-Burman Latin Latin 101,336
Min Nan zh_min_nan Tibeto-Burman Tibeto-Burman Latin Latin 309,129
Minangkabau min Malayo-Polynesian Malayic Latin 158,742
Mingrelian xmf Kartvelian Kartvelian Georgian 187,555
Mirandese mwl Romance Iberian Latin 203,845
Moksha mdf Moksha Moksha Cyrillic 34,186
Mon mnw Austroasiatic Austroasiatic Myanmar 42,496
Mongolian mn Mongolic Mongolic Cyrillic 193,476
Moroccan Arabic ary Afro-Asiatic Semitic Arabic 143,786
N'Ko nqo Mande Mande N'Ko 196,488
Nahuatl nah Uto-Aztecan Uto-Aztecan Latin 62,805
Nauruan na Malayo-Polynesian Other Malayo-Polynesian Latin 9,774
Navajo nv Athabaskan Athabaskan Latin 28,125
Neapolitan nap Romance Italo-Dalmatian Latin 185,831
Nepali ne Indic Northern Indic Devanagari 179,096
Newar new Tibeto-Burman Other Tibeto-Burman Devanagari 122,917
Nias nia Malayo-Polynesian Other Malayo-Polynesian Latin 52,144
Norfolk pih Creole English Creole Latin 21,202
Norman nrm Romance Gallo-Romance Latin 165,642
North Frisian frr Germanic Anglic Latin 191,321
Northern Sami se Uralic Sami Latin 100,856
Northern Sotho nso Niger-Congo Bantu Latin 25,626
Norwegian (Bokmål) no Germanic North Germanic Latin 196,357
Norwegian (Nynorsk) nn Germanic North Germanic Latin 193,537
Novial nov Constructed Constructed Latin 38,318
Occitan oc Romance Gallo-Romance Latin 208,924
Old Church Slavonic cu Slavic South Slavic Cyrillic 10,108
Oriya or Indic Eastern Indic Odia 184,928
Oromo om Afro-Asiatic Other Afro-Asiatic Latin 233,367
Ossetian os Iranian Iranian Cyrillic 194,122
Palatinate German pfl Germanic High German Latin 197,124
Pali (Latin) pi Indic Other Indic Latin 5,971
Pali (Devanagari) pi Indic Other Indic Devanagari 1,157
Pangasinan pag Malayo-Polynesian Phillipine Latin 31,387
Papiamentu pap Creole Portuguese Creole Latin 195,235
Pashto ps Iranian Iranian Arabic 199,283
Pennsylvania German pdc Germanic High German Latin 37,682
Persian fa Iranian Iranian Arabic 201,909
Picard pcd Romance Gallo-Romance Latin 102,478
Piedmontese pms Romance Gallo-Italic Latin 218,355
Polish pl Slavic West Slavic Latin 195,119
Pontic pnt Hellenic Hellenic Greek 20,894
Portuguese pt Romance Iberian Latin 198,672
Punjabi pa Indic Northwestern Indic Gurmukhi 195,713
Quechua qu Quechua Quechua Latin 40,329
Ripuarian ksh Germanic High German Latin 198,243
Romani rmy Indic Western Indic Latin 15,487
Romanian ro Romance Other Romance Latin 210,279
Romansh rm Romance Other Romance Latin 208,626
Russian ru Slavic East Slavic Cyrillic 200,538
Rusyn rue Slavic West Slavic Cyrillic 187,752
Saaraiki skr Indic Northwestern Indic Arabic 197,266
Sakha sah Turkic Siberian Cyrillic 191,813
Sakizaya szy Austronesian Austronesian Latin 194,641
Samoan sm Malayo-Polynesian Polynesian Latin 83,345
Samogitian bat_smg Baltic Baltic Latin 105,660
Sango sg Niger-Congo Other Niger-Congo Latin 3,746
Sanskrit sa Indic Other Indic Devanagari 177,498
Santali sat Austroasiatic Austroasiatic Ol Chiki 187,788
Sardinian sc Romance Other Romance Latin 204,144
Saterland Frisian stq Germanic Anglic Latin 186,897
Scots sco Germanic Anglic Latin 199,515
Scottish Gaelic gd Celtic Celtic Latin 206,651
Serbian sr Slavic South Slavic Cyrillic 190,415
Serbo-Croatian sh Slavic South Slavic Latin 192,522
Sesotho st Niger-Congo Bantu Latin 55,756
Shan shn Tai Tai Myanmar 31,048
Shona sn Niger-Congo Bantu Latin 76,706
Sicilian scn Romance Italo-Dalmatian Latin 204,500
Silesian szl Slavic West Slavic Latin 132,556
Simple English simple Germanic Anglic Latin 198,481
Sindhi sd Indic Northwestern Indic Arabic 184,506
Sinhalese si Indic Southern Indic Sinhala 188,511
Slovak sk Slavic West Slavic Latin 195,932
Slovenian sl Slavic South Slavic Latin 161,193
Somali so Afro-Asiatic Other Afro-Asiatic Latin 194,843
Sorani ckb Iranian Iranian Arabic 198,405
South Azerbaijani azb Turkic Oghuz Arabic 199,293
Spanish es Romance Iberian Latin 202,983
Sranan srn Creole English Creole Latin 12,032
Sundanese su Malayo-Polynesian Other Malayo-Polynesian Latin 103,181
Swahili sw Niger-Congo Bantu Latin 195,587
Swati ss Niger-Congo Bantu Latin 23,888
Swedish sv Germanic North Germanic Latin 198,498
Tagalog tl Malayo-Polynesian Phillipine Latin 196,218
Tahitian ty Malayo-Polynesian Polynesian Latin 3,858
Tajik tg Iranian Iranian Cyrillic 190,743
Tamil ta Dravidian Dravidian Tamil 192,511
Tarantino roa_tara Romance Italo-Dalmatian Latin 205,377
Tatar (Latin) tt Turkic Kipchak Latin 60,991
Tatar (Cyrillic) tt Turkic Kipchak Cyrillic 128,935
Telugu te Dravidian Dravidian Telugu 193,017
Tetum tet Malayo-Polynesian Other Malayo-Polynesian Latin 108,003
Thai th Tai Tai Thai 38,480
Tibetan bo Tibeto-Burman Other Tibeto-Burman Tibetan 428,495
Tigrinya ti Afro-Asiatic Semitic Ethiopic 21,081
Tok Pisin tpi Creole English Creole Latin 16,046
Tongan to Malayo-Polynesian Polynesian Latin 36,884
Tsonga ts Niger-Congo Bantu Latin 88,346
Tswana tn Niger-Congo Bantu Latin 203,475
Tulu tcy Dravidian Dravidian Kannada 168,376
Tumbuka tum Niger-Congo Bantu Latin 9,480
Turkish tr Turkic Oghuz Latin 202,601
Turkmen tk Turkic Oghuz Latin 202,877
Tuvan tyv Turkic Siberian Cyrillic 208,827
Twi tw Niger-Congo Bantu Latin 35,061
Udmurt udm Uralic Permic Cyrillic 75,154
Ukrainian uk Slavic East Slavic Cyrillic 193,133
Upper Sorbian hsb Slavic West Slavic Latin 190,832
Urdu ur Indic Central Indic Arabic 195,713
Uyghur ug Turkic Karluk Arabic 164,208
Uzbek uz Turkic Karluk Latin 197,038
Venda ve Niger-Congo Bantu Latin 17,566
Venetian vec Romance Italo-Dalmatian Latin 200,507
Vepsian vep Uralic Finnic Latin 179,658
Vietnamese vi Austroasiatic Austroasiatic Latin 200,641
Volapük vo Constructed Constructed Latin 163,340
Võro fiu_vro Uralic Finnic Latin 122,791
Walloon wa Romance Gallo-Romance Latin 197,108
Waray-Waray war Malayo-Polynesian Phillipine Latin 200,277
Welsh cy Celtic Celtic Latin 204,242
West Flemish vls Germanic Dutch Latin 200,520
West Frisian fy Germanic Anglic Latin 194,755
Western Armenian hyw Armenian Armenian Armenian 196,090
Western Punjabi pnb Indic Northwestern Indic Arabic 196,375
Wolof wo Niger-Congo Other Niger-Congo Latin 203,888
Wu wuu Tibeto-Burman Other Tibeto-Burman Chinese 6,303
Xhosa xh Niger-Congo Bantu Latin 130,727
Yiddish yi Germanic High German Hebrew 199,605
Yoruba yo Niger-Congo Other Niger-Congo Latin 153,063
Zamboanga Chavacano cbk_zam Creole Spanish Creole Latin 195,097
Zazaki diq Iranian Iranian Latin 198,835
Zeelandic zea Germanic Dutch Latin 204,440
Zhuang za Tai Tai Latin 4,509
Zulu zu Niger-Congo Bantu Latin 102,758