CIIL conference highlights 15 newly developed datasets and AI applications for Indian languages

The release of 15 newly developed datasets by LDC-IL was the key highlight of the conference on artificial intelligence (AI) at the CIIL here on Thursday.

The datasets, released by Shailendra Mohan, Director of the CIIL, and other dignitaries, mark a significant milestone in LDC-IL’s contributions to linguistic research and technology development, according to CIIL.

The datasets include – Mother Tongue Parallel Text Corpus of India (147 mother tongues), Gold Standard Rajasthani Raw Text Corpus, Gold Standard Chhattisgarhi Raw Text Corpus Vol. II, Gold Standard Kashmiri Raw Text Corpus Vol. II, Gold Standard Maithili Raw Text Corpus Vol. II, Gold Standard Telugu Raw Text Corpus Vol. II, Maithili Raw Speech Corpus Vol. II, Dogri Sentence Aligned Speech Corpus, Maithili Sentence Aligned Speech Corpus (Tirhuta Script), Manipuri Sentence Aligned Speech Corpus (Bengali Script), Manipuri Sentence Aligned Speech Corpus (Meetei Mayek), Punjabi Sentence Aligned Speech Corpus, Telugu Sentence Aligned Speech Corpus, Assamese Text-to-Speech Corpus, and Maithili Text-to-Speech Corpus.

In addition, LDC-IL launched several AI applications designed to serve Indian languages, introduced by Narayan Choudhary.

These applications, now available for public use at medha.ciil.org, include Anuvadika (Machine Translator), Lipyantara (Transliterator), Lipidha (Optical Character Recognizer), Anulekhika (Automatic Speech Recognition for Indian Languages), Anuvachika (Text-to-Speech Recognition for Indian Languages), and Dhvani Parivartka (Media Converter).

Published – March 20, 2025 08:22 pm IST