Open collaboration thrives on shared language, but the world’s developers are not all fluent in English. A new dataset from GitHub now maps how codebases, issues, and pull requests flow in dozens of languages—offering researchers a rare lens into multilingual software development.
The GitHub Multilingual Repositories Dataset surfaces public repositories where natural-language content isn’t limited to English, tracking patterns in READMEs, issues, and pull requests across more than 40 million projects. Rather than exposing raw code or comments, it provides metadata and language classifications for the first 150 characters of key text sources in each repository. The dataset arrives alongside a broader push to make multilingual developer data more accessible, aligning with commitments outlined in Microsoft’s 2025 European Digital Commitments.
Behind the dataset: language signals in developer workflows
The dataset avoids dumping raw repository content, instead focusing on metadata that reveals language usage in critical collaboration points. For each public repository, it includes:
- Language classifications for READMEs, the most-commented issue, and the most-commented pull request, using the first 150 characters of each text source. Only texts longer than 20 characters are included to filter out trivial entries.
- Classifier outputs from three language detection tools—fastText, gcld3, and lingua-py—each with a confidence score. Only classifications scoring above 0.5 are retained, ensuring a baseline of reliability.
- Repository metadata, such as creation date, disk usage, star and fork counts, primary programming language, SPDX license, issue and pull request counts, and the snapshot date.
The dataset deliberately preserves the outputs of all three classifiers instead of collapsing them into a single label. This design choice acknowledges that different tools perform unevenly across languages, especially those with fewer digital resources. For researchers, this transparency allows tuning precision and recall to fit specific needs—whether targeting high-precision subsets like Greek or broader exploratory studies of Romance languages.
What developers and researchers can do with the data
The dataset is engineered for projects that are difficult to tackle using generic web text sources. Its applications span discovery, research, tooling, and advocacy:
- Repository discovery: Filter repositories likely to contain documentation, discussions, or collaboration in specific languages. This is valuable for teams building localized developer tools or expanding product support.
- Community studies: Analyze how non-English developer communities structure issues, pull requests, and READMEs. Patterns in language use can reveal cultural and technical preferences unique to language groups.
- AI evaluation sets: Create benchmarks for multilingual AI coding assistants, documentation generators, or code review tools. Real-world examples from issues and pull requests offer context-rich, domain-specific training data.
- Language inclusion advocacy: Use data-backed insights to justify expanding language coverage in new developer tools and AI features. Metrics on underrepresented languages can inform product roadmaps and community outreach strategies.
- Language representation measurement: Quantify how European and other underrepresented languages appear in open source, helping track progress toward more inclusive software ecosystems.
Important limitations: language detection in code contexts
Identifying languages in software repositories is notoriously tricky. Short text snippets—like README badges, code snippets, or mixed-language comments—can skew classifier results. A 150-character sample may not capture the full linguistic context of a repository. Moreover, classifier performance varies widely across languages, particularly those with limited digital corpora.
This dataset is not intended as a definitive benchmark for language identification. Instead, it serves as a transparent discovery tool. Users can review confidence scores, classifier sources, and raw classifications to determine the appropriate balance between precision and recall for their use case. The dataset also avoids inferring sensitive attributes about contributors or communities, as it operates at the repository level, not the individual level.
Why multilingual developer data matters for AI
Many European languages remain underrepresented in the datasets used to train and evaluate AI systems. This imbalance risks leaving developers who rely on those languages at a disadvantage, with tools that work inconsistently or poorly for their workflows. Open, domain-specific datasets like this one help close that gap by focusing on the unique language patterns of software collaboration.
READMEs, issues, and pull requests contain the vocabulary of real-world development: installation instructions, bug reports, feature requests, review comments, and community norms. Capturing this context enables AI systems to better understand how developers actually communicate and build software. By making multilingual developer-content signals easier to find and analyze, this dataset empowers researchers, open source contributors, and model builders to build more inclusive tools for developers across Europe and beyond.
The push for open multilingual data reflects a core principle: AI tools built for developers must reflect the languages, communities, and workflows those developers actually use. This dataset is a step toward making that principle a reality.
Looking ahead, GitHub and partners will discuss the dataset and broader themes of open data for multilingual AI at the Open Innovation Dialogue Hub in Strasbourg on June 16, co-organized by the Microsoft Open Innovation Center.
AI summary
GitHub’ın yayınladığı 80 milyon satırlık çok dilli veri seti, araştırmacıların ve geliştiricilerin İngilizce olmayan dillerde kodlama ekosistemini keşfetmesini sağlıyor. Nasıl kullanılır, neler içerir?