The need for comprehensive and accessible Turkish wordlists has been a recurring challenge in linguistic and computational research. Existing publicly available datasets often fail to meet immediate needs in terms of scale, diversity, or usability. Motivated by this gap, I constructed a dataset derived from Turkish Wikipedia articles, which serves as an linguistic resource.
The dataset was created by processing and extracting textual content from Turkish Wikipedia, resulting in a comprehensive collection of 2,510,327 words. The dataset is organized in a CSV file format, ensuring ease of use for various applications, including natural language processing (NLP), linguistic analysis, and machine learning.
The lack of satisfactory publicly available Turkish wordlists prompted the exploration of alternative strategies, culminating in this contribution. Despite exhaustive searches using keywords such as “Turkish wordlist,” “corpus,” and “dataset,” no existing resources were identified that met the required criteria. This dataset thus represents a significant step toward filling this gap, offering researchers and practitioners a ready-to-use resource for Turkish language studies.
Detailed Screening
Turkish dataset. Tur
Their word count is very low. It also shows lack of understanding and a barrier at least for starters.
How it is made
Specs
Turkish word list derived from wikipedia consisting of 2,510,327 words created for use.
It is an UTF-8 encoded csv file with headers.
The words were obtained by processing the text of approximately 500 thousand articles on Wikipedia. Example:
If an English word is used in Turkish page and does not contain English characters, it is also included in the list.
List contains Turkish Alphabet and quotation characters . Older use of Turkish characters such as kâtip replaced.
Word Format:
All words consist of lowercase letters. Maximum character count is 30. There are no other characters other than Turkish letters and distinction marks.
Where can it be used
Could be used for
Calculating Turkish score.
How it is meaningfull in the age of Chatgpt
Usage Examples
A Sample Case
Poems of Atilla Ilhan.
Future Directions
Expanding it.
Resources:
https://github.com/agmmnn/turkish-nlp-resources?tab=readme-ov-file