Luke Lindemann
February 19, 2021
Wikipedia currently has versions in over 300 different languages from about forty different language families and more than thirty different scripts. This makes it ideal for the collection of sample texts for multilingual corpora. I have collected and processed Wikipedia entries for every language version that contains 100+ articles. I limited each sample to 200,000 words. The fully cleaned text corpus (110 MB) can be downloaded here, and the raw extracted files (130 MB) can be downloaded here. Basic information about each language is in a chart at the end of this document. Below, I detail the process for creating a text corpus from Wikipedia.
The most recent text of any Wikipedia version can be downloaded as a Wikimedia dump file here. The complete list of languages and the Wikicodes for each language are here. The dump files extract to xml with Wikipedia mark-up code. In the past, I have used the WikiCorpus function from the Python Gensim module to process the texts. I believe this method works well for building a single massive English corpus, but I have been unable to use it adequately for my purpose - building a corpus of smaller sample texts from hundreds of dump files in different languages.
I typically only want to extract the first 200 or so articles from a file rather than process the entire file, which can run multiple hours for languages with millions of articles like Spanish and English. Processing all 300+ wikipedia versions has taken me multiple days in the past. Furthermore, for many languages the processing fails to remove much of the markup code (especially text formatting), tables, and long irrelevant lists of topics or languages at the bottom of an article. As far as I am aware, the cleaning is done entirely through regular expressions, which is inadequate for handling recursively-nested templates. Additionally, it deletes characters in some scripts, particularly Brahmi-derived Indic scripts like Devanagari.
I wrote a Python program, WikiTextExtract, which is available here, to extract text from a folder containing Wikipedia dump files. My priority is to capture long stretches of running text, so it is aggressive in filtering out markup code, lists, tables, and templates. You can set the maximum number of lines to read, which allows a portion of the file to be processed quickly, as well as a maximum number of either full articles or words to extract, and it can filter out short articles that pass a minimum word threshold. Each article is listed separately on a metadata line along with its title and the number of words in the article.
For this corpus, I set a limit of 200,000 words for each language (or, for languages with scripts that do not use spaces for words - Burmese, Chinese, Classical Chinese, Dzongkha, Gan, Japanese, Khmer, Lao, Mon, Shan, Thai, Tibetan, Yue - the first 100 articles). About half of the versions contain fewer than 200,000 words total. Each accepted article must be 100 words or longer, to avoid articles which contain formulaic entries and lists (for a small number of languages which only contain short entries - Cheyenne, Cree, Inuktitut, Inupiak, Kongo, Greenlandic, Kashmiri, Lak, Madurese, Nauruan, Norfolk, Kirundi - I lowered the threshold to ten words and deleted repetitive entries by hand). I then did a quick eye scan through the resulting files and deleted articles or portions of text in irrelevant languages, as well as remaining lists and headings (often I deleted the head article because it does not contain encyclopedia-style running text).
WikiTextExtract does not clean the text for processing. In other words, the resulting text leaves intact the uppercase letters, punctuation, indentation, etc. This is because I wanted it to be easy to read the extracted articles. For the cleaned texts, I used R to scan the text (ignoring the titles and word counts), making everything lowercase, and (for the Latin, Greek, Cyrillic, Arabic, and Hebrew scripts) deleting rare characters which occur with a less than .01% frequency.
I then restricted the text to characters that occur within the Unicode block of that language's script so that, for example, a Cyrillic text is cleaned of Latin and Arabic characters used in loanwords or quoted text. Six Wikipedia versions use multiple scripts (Cree, Tatar, Buginese, Pali, Inupiak, and Kashmir). For these languages, about half the articles are written in one script and half are written in the other. For these I created two separate texts, restricting by one script and then the other.
Deleting punctuation is trickier than using a [[:punct:]] regular expression, because many languages use punctuation symbols in their main orthographies. A noteworthy example is the apostrophe, and variant symbols, which are used phonemically in the Latin-based scripts of Acehnese, Aymara, Buginese, Central Bicolano, Chamorro, Cheyenn, Fula, Gorontalo, Guarani, Hawaiian, Igbo, Lojban, Madurese, Navajo, Malagasy, Maori, Nias, Quechua, Samoan, Tatar, Tetum, Uzbek, and Xhosa. There are also languages which use punctuation symbols in digraphs, including Breton and Catalan.
Update from January 24, 2024: The author of this blog post helpfully pointed out that the [[:punct:]] regular expression deletes characters (\u104c-\u104f) in the Burmese text. I have updated that text to fix the issue.
For the other scripts, I did my best to research the punctuation and numeral characters for each script so that I could exclude them. I hope to have created a fairly simple and effective method for extracting text from Wikipedia pages.
The chart below gives basic information about each language.
Language | WikiCode | Family | Subgroup | Script | Word Count |
---|---|---|---|---|---|
Abkhazian | ab | Caucasian | Caucasian | Cyrillic | 58,100 |
Acehnese | ace | Malayo-Polynesian | Other Malayo-Polynesian | Latin | 90,025 |
Adyghe | ady | Caucasian | Caucasian | Cyrillic | 10,791 |
Afrikaans | af | Germanic | Dutch | Latin | 198,349 |
Akan | ak | Niger-Congo | Other Niger-Congo | Latin | 82,731 |
Albanian | sq | Albanian | Albanian | Latin | 193,060 |
Alemannic | als | Germanic | High German | Latin | 197,012 |
Amharic | am | Afro-Asiatic | Semitic | Ethiopic | 193,388 |
Anglo Saxon | ang | Germanic | Anglic | Latin | 109,182 |
Arabic | ar | Afro-Asiatic | Semitic | Arabic | 195,629 |
Aragonese | an | Romance | Iberian | Latin | 210,697 |
Aramaic | arc | Afro-Asiatic | Semitic | Syriac | 9,236 |
Armenian | hy | Armenian | Armenian | Armenian | 194,344 |
Aromanian | roa_rup | Romance | Other Romance | Latin | 15,126 |
Assamese | as | Indic | Eastern Indic | Bengali | 189,182 |
Asturian | ast | Romance | Iberian | Latin | 206,485 |
Atikamekw | atj | Algonquian | Algonquian | Latin | 34,454 |
Avar | av | Caucasian | Caucasian | Cyrillic | 88,140 |
Awadhi | awa | Indic | Central Indic | Devanagari | 84,403 |
Aymara | ay | Aymara | Aymara | Latin | 23,822 |
Azerbaijani | az | Turkic | Oghuz | Latin | 199,917 |
Balinese | ban | Malayo-Polynesian | Other Malayo-Polynesian | Latin | 199,768 |
Bambara | bm | Mande | Mande | Latin | 25,163 |
Banjar | bjn | Austronesian | Austronesian | Latin | 199,761 |
Banyumasan | map_bms | Malayo-Polynesian | Javanesic | Latin | 198,810 |
Bashkir | ba | Turkic | Kipchak | Cyrillic | 194,138 |
Basque | eu | Vasconic | Vasconic | Latin | 199,658 |
Bavarian | bar | Germanic | High German | Latin | 197,795 |
Belarusian | be | Slavic | East Slavic | Cyrillic | 192,440 |
Belarusian (Taraškievica) | be_x_old | Slavic | East Slavic | Cyrillic | 189,806 |
Bengali | bn | Indic | Eastern Indic | Bengali | 191,876 |
Bihari | bh | Indic | Eastern Indic | Devanagari | 189,977 |
Bishnupriya Manipuri | bpy | Indic | Eastern Indic | Bengali | 79,490 |
Bislama | bi | Creole | English Creole | Latin | 3,145 |
Bosnian | bs | Slavic | South Slavic | Latin | 195,997 |
Breton | br | Celtic | Celtic | Latin | 206,119 |
Buginese (Latin) | bug | Malayo-Polynesian | Other Malayo-Polynesian | Latin | 7,126 |
Buginese (Buginese) | bug | Malayo-Polynesian | Other Malayo-Polynesian | Buginese | 2,158 |
Bulgarian | bg | Slavic | South Slavic | Cyrillic | 194,249 |
Burmese | my | Tibeto-Burman | Other Tibeto-Burman | Myanmar | 75,527 |
Buryat | bxr | Mongolic | Mongolic | Cyrillic | 192,042 |
Cantonese | zh_yue | Tibeto-Burman | Other Tibeto-Burman | Chinese | 31,888 |
Catalan | ca | Romance | Gallo-Romance | Latin | 209,944 |
Cebuano | ceb | Malayo-Polynesian | Phillipine | Latin | 200,597 |
Central Bicolano | bcl | Malayo-Polynesian | Phillipine | Latin | 197,619 |
Chamorro | ch | Malayo-Polynesian | Other Malayo-Polynesian | Latin | 6,547 |
Chechen | ce | Caucasian | Caucasian | Cyrillic | 195,158 |
Cherokee | chr | Iroquoian | Iroquoian | Cherokee | 11,835 |
Cheyenne | chy | Algonquian | Algonquian | Latin | 318 |
Chichewa | ny | Niger-Congo | Bantu | Latin | 99,634 |
Chinese | zh | Tibeto-Burman | Other Tibeto-Burman | Chinese | 124,650 |
Chuvash | cv | Turkic | Oghuz | Cyrillic | 188,153 |
Classical Chinese | zh_classical | Tibeto-Burman | Other Tibeto-Burman | Chinese | 12,778 |
Cornish | kw | Celtic | Celtic | Latin | 139,379 |
Corsican | co | Romance | Italo-Dalmatian | Latin | 207,593 |
Cree (Canadian Syllabics) | cr | Algonquian | Algonquian | Canadian Syllabics | 123 |
Cree (Latin) | cr | Algonquian | Algonquian | Latin | 352 |
Crimean Tatar | crh | Turkic | Kipchak | Latin | 54,339 |
Croatian | hr | Slavic | South Slavic | Latin | 194,591 |
Czech | cs | Slavic | West Slavic | Latin | 196,446 |
Danish | da | Germanic | North Germanic | Latin | 196,664 |
Dinka | din | Nilotic | Nilotic | Latin | 64,389 |
Divehi | dv | Indic | Southern Indic | Thaana | 186,755 |
Doteli | dty | Indic | Northern Indic | Devanagari | 187,576 |
Dutch | nl | Germanic | Dutch | Latin | 197,198 |
Dutch Low Saxon | nds_nl | Germanic | Low German | Latin | 199,215 |
Dzongkha | dz | Tibeto-Burman | Other Tibeto-Burman | Tibetan | 24,147 |
Egyptian Arabic | arz | Afro-Asiatic | Semitic | Arabic | 195,437 |
Emilian-Romagnol | eml | Romance | Gallo-Italic | Latin | 162,961 |
English | en | Germanic | Anglic | Latin | 199,564 |
Erzya | myv | Uralic | Other Uralic | Cyrillic | 180,837 |
Esperanto | eo | Constructed | Constructed | Latin | 197,285 |
Estonian | et | Uralic | Finnic | Latin | 192,800 |
Ewe | ee | Niger-Congo | Other Niger-Congo | Latin | 18,632 |
Extremaduran | ext | Romance | Iberian | Latin | 200,577 |
Faroese | fo | Germanic | North Germanic | Latin | 191,824 |
Fiji Hindi | hif | Indic | Central Indic | Latin | 128,450 |
Fijian | fj | Malayo-Polynesian | Other Malayo-Polynesian | Latin | 42,360 |
Finnish | fi | Uralic | Finnic | Latin | 196,573 |
Franco-Provençal | frp | Romance | Gallo-Romance | Latin | 58,163 |
French | fr | Romance | Gallo-Romance | Latin | 233,025 |
Friulian | fur | Romance | Other Romance | Latin | 198,007 |
Fula | ff | Niger-Congo | Other Niger-Congo | Latin | 40,601 |
Gagauz | gag | Turkic | Oghuz | Latin | 67,724 |
Galician | gl | Romance | Iberian | Latin | 197,864 |
Gan | gan | Tibeto-Burman | Other Tibeto-Burman | Chinese | 6,514 |
Georgian | ka | Kartvelian | Kartvelian | Georgian | 190,379 |
German | de | Germanic | High German | Latin | 197,197 |
Gilaki | glk | Iranian | Iranian | Arabic | 182,832 |
Goan Konkani | gom | Indic | Southern Indic | Devanagari | 197,428 |
Gorontalo | gor | Malayo-Polynesian | Phillipine | Latin | 53,579 |
Gothic | got | Germanic | Other Germanic | Gothic | 9,352 |
Greek | el | Hellenic | Hellenic | Greek | 192,045 |
Greenlandic | kl | Inuit | Inuit | Latin | 18,203 |
Guarani | gn | Tupian | Tupian | Latin | 198,012 |
Guianan Creole | gcr | Creole | Portuguese Creole | Latin | 115,926 |
Gujarati | gu | Indic | Western Indic | Gujarati | 195,421 |
Haitian | ht | Creole | French Creole | Latin | 188,176 |
Hakka | hak | Tibeto-Burman | Tibeto-Burman Latin | Latin | 84,588 |
Hausa | ha | Afro-Asiatic | Other Afro-Asiatic | Latin | 200,918 |
Hawaiian | haw | Malayo-Polynesian | Polynesian | Latin | 112,870 |
Hebrew | he | Afro-Asiatic | Semitic | Hebrew | 194,947 |
Hill Mari | mrj | Uralic | Mari | Cyrillic | 91,189 |
Hindi | hi | Indic | Central Indic | Devanagari | 195,809 |
Hungarian | hu | Uralic | Other Uralic | Latin | 198,457 |
Icelandic | is | Germanic | North Germanic | Latin | 194,966 |
Ido | io | Constructed | Constructed | Latin | 197,718 |
Igbo | ig | Niger-Congo | Other Niger-Congo | Latin | 188,440 |
Ilokano | ilo | Malayo-Polynesian | Phillipine | Latin | 198,139 |
Inari Sami | smn | Uralic | Sami | Latin | 34,021 |
Indonesian | id | Malayo-Polynesian | Malayic | Latin | 201,308 |
Ingush | inh | Caucasian | Caucasian | Cyrillic | 27,865 |
Interlingua | ia | Constructed | Constructed | Latin | 195,759 |
Interlingue | ie | Constructed | Constructed | Latin | 187,169 |
Inuktitut (Canadian Syllabics) | iu | Inuit | Inuit | Canadian Syllabics | 1,923 |
Inuktitut (Latin) | iu | Inuit | Inuit | Latin | 1,415 |
Inupiak | ik | Inuit | Inuit | Latin | 692 |
Irish | ga | Celtic | Celtic | Latin | 199,466 |
Italian | it | Romance | Italo-Dalmatian | Latin | 204,004 |
Jamaican Patois | jam | Creole | English Creole | Latin | 95,784 |
Japanese | ja | Japonic | Japonic | Japanese | 74,348 |
Javanese | jv | Malayo-Polynesian | Javanesic | Latin | 199,101 |
Kabardian Circassian | kbd | Caucasian | Caucasian | Cyrillic | 91,720 |
Kabiye | kbp | Niger-Congo | Other Niger-Congo | Latin | 206,498 |
Kabyle | kab | Afro-Asiatic | Other Afro-Asiatic | Latin | 210,877 |
Kalmyk | xal | Mongolic | Mongolic | Cyrillic | 23,437 |
Kannada | kn | Dravidian | Dravidian | Kannada | 198,531 |
Kapampangan | pam | Malayo-Polynesian | Phillipine | Latin | 193,277 |
Karachay-Balkar | krc | Turkic | Kipchak | Cyrillic | 145,979 |
Karakalpak | kaa | Turkic | Kipchak | Latin | 121,182 |
Kashmiri (Arabic) | ks | Indic | Other Indic | Arabic | 6,256 |
Kashmiri (Devanagari) | ks | Indic | Other Indic | Devanagari | 3,641 |
Kashubian | csb | Slavic | West Slavic | Latin | 120,564 |
Kazakh | kk | Turkic | Kipchak | Cyrillic | 195,972 |
Khmer | km | Austroasiatic | Austroasiatic | Khmer | 28,173 |
Kikuyu | ki | Niger-Congo | Bantu | Latin | 10,016 |
Kinyarwanda | rw | Niger-Congo | Bantu | Latin | 126,721 |
Kirghiz | ky | Turkic | Kipchak | Cyrillic | 194,186 |
Kirundi | rn | Niger-Congo | Bantu | Latin | 32,282 |
Komi | kv | Uralic | Permic | Cyrillic | 139,004 |
Komi-Permyak | koi | Uralic | Permic | Cyrillic | 82,138 |
Kongo | kg | Niger-Congo | Bantu | Latin | 21,763 |
Korean | ko | Koreanic | Koreanic | Hangul | 203,285 |
Kotova | avk | Constructed | Constructed | Latin | 190,260 |
Kurdish | ku | Iranian | Iranian | Latin | 201,217 |
Ladino | lad | Romance | Iberian | Latin | 114,922 |
Ladino | lld | Romance | Gallo-Romance | Latin | 114,922 |
Lak | lbe | Caucasian | Caucasian | Cyrillic | 22,564 |
Lao | lo | Tai | Tai | Lao | 21,063 |
Latgalian | ltg | Baltic | Baltic | Latin | 41,482 |
Latin | la | Romance | Other Romance | Latin | 193,514 |
Latvian | lv | Baltic | Baltic | Latin | 193,369 |
Lezgian | lez | Caucasian | Caucasian | Cyrillic | 194,005 |
Ligurian | lij | Romance | Gallo-Italic | Latin | 214,470 |
Limburgish | li | Germanic | Other Germanic | Latin | 201,569 |
Lingala | ln | Niger-Congo | Bantu | Latin | 64,511 |
Lingua Franca Nova | lfn | Constructed | Constructed | Latin | 196,299 |
Lithuanian | lt | Baltic | Baltic | Latin | 194,671 |
Livvi-Karelian | olo | Uralic | Other Uralic | Latin | 126,766 |
Lojban | jbo | Constructed | Constructed | Latin | 59,390 |
Lombard | lmo | Romance | Gallo-Italic | Latin | 208,682 |
Low Saxon | nds | Germanic | Low German | Latin | 199,049 |
Lower Sorbian | dsb | Slavic | West Slavic | Latin | 131,759 |
Luganda | lg | Niger-Congo | Bantu | Latin | 204,897 |
Luxembourgisch | lb | Germanic | High German | Latin | 200,807 |
Macedonian | mk | Slavic | South Slavic | Cyrillic | 199,719 |
Madurese | mad | Malayo-Polynesian | Other Malayo-Polynesian | Latin | 46,468 |
Maithili | mai | Indic | Eastern Indic | Devanagari | 186,573 |
Malagasy | mg | Malayo-Polynesian | Other Malayo-Polynesian | Latin | 186,592 |
Malay | ms | Malayo-Polynesian | Malayic | Latin | 198,741 |
Malayalam | ml | Dravidian | Dravidian | Malayalam | 78,073 |
Maltese | mt | Afro-Asiatic | Semitic | Latin | 241,726 |
Manx | gv | Celtic | Celtic | Latin | 208,179 |
Maori | mi | Malayo-Polynesian | Polynesian | Latin | 88,263 |
Marathi | mr | Indic | Southern Indic | Devanagari | 188,409 |
Mazandarani | mzn | Iranian | Iranian | Arabic | 198,174 |
Meadow Mari | mhr | Uralic | Mari | Cyrillic | 199,570 |
Min Dong | cdo | Tibeto-Burman | Tibeto-Burman Latin | Latin | 101,336 |
Min Nan | zh_min_nan | Tibeto-Burman | Tibeto-Burman Latin | Latin | 309,129 |
Minangkabau | min | Malayo-Polynesian | Malayic | Latin | 158,742 |
Mingrelian | xmf | Kartvelian | Kartvelian | Georgian | 187,555 |
Mirandese | mwl | Romance | Iberian | Latin | 203,845 |
Moksha | mdf | Moksha | Moksha | Cyrillic | 34,186 |
Mon | mnw | Austroasiatic | Austroasiatic | Myanmar | 42,496 |
Mongolian | mn | Mongolic | Mongolic | Cyrillic | 193,476 |
Moroccan Arabic | ary | Afro-Asiatic | Semitic | Arabic | 143,786 |
N'Ko | nqo | Mande | Mande | N'Ko | 196,488 |
Nahuatl | nah | Uto-Aztecan | Uto-Aztecan | Latin | 62,805 |
Nauruan | na | Malayo-Polynesian | Other Malayo-Polynesian | Latin | 9,774 |
Navajo | nv | Athabaskan | Athabaskan | Latin | 28,125 |
Neapolitan | nap | Romance | Italo-Dalmatian | Latin | 185,831 |
Nepali | ne | Indic | Northern Indic | Devanagari | 179,096 |
Newar | new | Tibeto-Burman | Other Tibeto-Burman | Devanagari | 122,917 |
Nias | nia | Malayo-Polynesian | Other Malayo-Polynesian | Latin | 52,144 |
Norfolk | pih | Creole | English Creole | Latin | 21,202 |
Norman | nrm | Romance | Gallo-Romance | Latin | 165,642 |
North Frisian | frr | Germanic | Anglic | Latin | 191,321 |
Northern Sami | se | Uralic | Sami | Latin | 100,856 |
Northern Sotho | nso | Niger-Congo | Bantu | Latin | 25,626 |
Norwegian (Bokmål) | no | Germanic | North Germanic | Latin | 196,357 |
Norwegian (Nynorsk) | nn | Germanic | North Germanic | Latin | 193,537 |
Novial | nov | Constructed | Constructed | Latin | 38,318 |
Occitan | oc | Romance | Gallo-Romance | Latin | 208,924 |
Old Church Slavonic | cu | Slavic | South Slavic | Cyrillic | 10,108 |
Oriya | or | Indic | Eastern Indic | Odia | 184,928 |
Oromo | om | Afro-Asiatic | Other Afro-Asiatic | Latin | 233,367 |
Ossetian | os | Iranian | Iranian | Cyrillic | 194,122 |
Palatinate German | pfl | Germanic | High German | Latin | 197,124 |
Pali (Latin) | pi | Indic | Other Indic | Latin | 5,971 |
Pali (Devanagari) | pi | Indic | Other Indic | Devanagari | 1,157 |
Pangasinan | pag | Malayo-Polynesian | Phillipine | Latin | 31,387 |
Papiamentu | pap | Creole | Portuguese Creole | Latin | 195,235 |
Pashto | ps | Iranian | Iranian | Arabic | 199,283 |
Pennsylvania German | pdc | Germanic | High German | Latin | 37,682 |
Persian | fa | Iranian | Iranian | Arabic | 201,909 |
Picard | pcd | Romance | Gallo-Romance | Latin | 102,478 |
Piedmontese | pms | Romance | Gallo-Italic | Latin | 218,355 |
Polish | pl | Slavic | West Slavic | Latin | 195,119 |
Pontic | pnt | Hellenic | Hellenic | Greek | 20,894 |
Portuguese | pt | Romance | Iberian | Latin | 198,672 |
Punjabi | pa | Indic | Northwestern Indic | Gurmukhi | 195,713 |
Quechua | qu | Quechua | Quechua | Latin | 40,329 |
Ripuarian | ksh | Germanic | High German | Latin | 198,243 |
Romani | rmy | Indic | Western Indic | Latin | 15,487 |
Romanian | ro | Romance | Other Romance | Latin | 210,279 |
Romansh | rm | Romance | Other Romance | Latin | 208,626 |
Russian | ru | Slavic | East Slavic | Cyrillic | 200,538 |
Rusyn | rue | Slavic | West Slavic | Cyrillic | 187,752 |
Saaraiki | skr | Indic | Northwestern Indic | Arabic | 197,266 |
Sakha | sah | Turkic | Siberian | Cyrillic | 191,813 |
Sakizaya | szy | Austronesian | Austronesian | Latin | 194,641 |
Samoan | sm | Malayo-Polynesian | Polynesian | Latin | 83,345 |
Samogitian | bat_smg | Baltic | Baltic | Latin | 105,660 |
Sango | sg | Niger-Congo | Other Niger-Congo | Latin | 3,746 |
Sanskrit | sa | Indic | Other Indic | Devanagari | 177,498 |
Santali | sat | Austroasiatic | Austroasiatic | Ol Chiki | 187,788 |
Sardinian | sc | Romance | Other Romance | Latin | 204,144 |
Saterland Frisian | stq | Germanic | Anglic | Latin | 186,897 |
Scots | sco | Germanic | Anglic | Latin | 199,515 |
Scottish Gaelic | gd | Celtic | Celtic | Latin | 206,651 |
Serbian | sr | Slavic | South Slavic | Cyrillic | 190,415 |
Serbo-Croatian | sh | Slavic | South Slavic | Latin | 192,522 |
Sesotho | st | Niger-Congo | Bantu | Latin | 55,756 |
Shan | shn | Tai | Tai | Myanmar | 31,048 |
Shona | sn | Niger-Congo | Bantu | Latin | 76,706 |
Sicilian | scn | Romance | Italo-Dalmatian | Latin | 204,500 |
Silesian | szl | Slavic | West Slavic | Latin | 132,556 |
Simple English | simple | Germanic | Anglic | Latin | 198,481 |
Sindhi | sd | Indic | Northwestern Indic | Arabic | 184,506 |
Sinhalese | si | Indic | Southern Indic | Sinhala | 188,511 |
Slovak | sk | Slavic | West Slavic | Latin | 195,932 |
Slovenian | sl | Slavic | South Slavic | Latin | 161,193 |
Somali | so | Afro-Asiatic | Other Afro-Asiatic | Latin | 194,843 |
Sorani | ckb | Iranian | Iranian | Arabic | 198,405 |
South Azerbaijani | azb | Turkic | Oghuz | Arabic | 199,293 |
Spanish | es | Romance | Iberian | Latin | 202,983 |
Sranan | srn | Creole | English Creole | Latin | 12,032 |
Sundanese | su | Malayo-Polynesian | Other Malayo-Polynesian | Latin | 103,181 |
Swahili | sw | Niger-Congo | Bantu | Latin | 195,587 |
Swati | ss | Niger-Congo | Bantu | Latin | 23,888 |
Swedish | sv | Germanic | North Germanic | Latin | 198,498 |
Tagalog | tl | Malayo-Polynesian | Phillipine | Latin | 196,218 |
Tahitian | ty | Malayo-Polynesian | Polynesian | Latin | 3,858 |
Tajik | tg | Iranian | Iranian | Cyrillic | 190,743 |
Tamil | ta | Dravidian | Dravidian | Tamil | 192,511 |
Tarantino | roa_tara | Romance | Italo-Dalmatian | Latin | 205,377 |
Tatar (Latin) | tt | Turkic | Kipchak | Latin | 60,991 |
Tatar (Cyrillic) | tt | Turkic | Kipchak | Cyrillic | 128,935 |
Telugu | te | Dravidian | Dravidian | Telugu | 193,017 |
Tetum | tet | Malayo-Polynesian | Other Malayo-Polynesian | Latin | 108,003 |
Thai | th | Tai | Tai | Thai | 38,480 |
Tibetan | bo | Tibeto-Burman | Other Tibeto-Burman | Tibetan | 428,495 |
Tigrinya | ti | Afro-Asiatic | Semitic | Ethiopic | 21,081 |
Tok Pisin | tpi | Creole | English Creole | Latin | 16,046 |
Tongan | to | Malayo-Polynesian | Polynesian | Latin | 36,884 |
Tsonga | ts | Niger-Congo | Bantu | Latin | 88,346 |
Tswana | tn | Niger-Congo | Bantu | Latin | 203,475 |
Tulu | tcy | Dravidian | Dravidian | Kannada | 168,376 |
Tumbuka | tum | Niger-Congo | Bantu | Latin | 9,480 |
Turkish | tr | Turkic | Oghuz | Latin | 202,601 |
Turkmen | tk | Turkic | Oghuz | Latin | 202,877 |
Tuvan | tyv | Turkic | Siberian | Cyrillic | 208,827 |
Twi | tw | Niger-Congo | Bantu | Latin | 35,061 |
Udmurt | udm | Uralic | Permic | Cyrillic | 75,154 |
Ukrainian | uk | Slavic | East Slavic | Cyrillic | 193,133 |
Upper Sorbian | hsb | Slavic | West Slavic | Latin | 190,832 |
Urdu | ur | Indic | Central Indic | Arabic | 195,713 |
Uyghur | ug | Turkic | Karluk | Arabic | 164,208 |
Uzbek | uz | Turkic | Karluk | Latin | 197,038 |
Venda | ve | Niger-Congo | Bantu | Latin | 17,566 |
Venetian | vec | Romance | Italo-Dalmatian | Latin | 200,507 |
Vepsian | vep | Uralic | Finnic | Latin | 179,658 |
Vietnamese | vi | Austroasiatic | Austroasiatic | Latin | 200,641 |
Volapük | vo | Constructed | Constructed | Latin | 163,340 |
Võro | fiu_vro | Uralic | Finnic | Latin | 122,791 |
Walloon | wa | Romance | Gallo-Romance | Latin | 197,108 |
Waray-Waray | war | Malayo-Polynesian | Phillipine | Latin | 200,277 |
Welsh | cy | Celtic | Celtic | Latin | 204,242 |
West Flemish | vls | Germanic | Dutch | Latin | 200,520 |
West Frisian | fy | Germanic | Anglic | Latin | 194,755 |
Western Armenian | hyw | Armenian | Armenian | Armenian | 196,090 |
Western Punjabi | pnb | Indic | Northwestern Indic | Arabic | 196,375 |
Wolof | wo | Niger-Congo | Other Niger-Congo | Latin | 203,888 |
Wu | wuu | Tibeto-Burman | Other Tibeto-Burman | Chinese | 6,303 |
Xhosa | xh | Niger-Congo | Bantu | Latin | 130,727 |
Yiddish | yi | Germanic | High German | Hebrew | 199,605 |
Yoruba | yo | Niger-Congo | Other Niger-Congo | Latin | 153,063 |
Zamboanga Chavacano | cbk_zam | Creole | Spanish Creole | Latin | 195,097 |
Zazaki | diq | Iranian | Iranian | Latin | 198,835 |
Zeelandic | zea | Germanic | Dutch | Latin | 204,440 |
Zhuang | za | Tai | Tai | Latin | 4,509 |
Zulu | zu | Niger-Congo | Bantu | Latin | 102,758 |