AI Data Open Source Program is an academic support program launched by Datatang for non-commercial organizations such as universities and academic institutions. It aims to empower global AI academic research. Datatang will continue to provide AI training data sets to academic researchers to help them overcome difficulties in terms of data, give full play to their professional abilities, realize technological progress and promote social development.
Welcome to join the Open Source Program of Datatang in building an intelligent era.
Open Source Program Object Oriented： open source data for non-commercial organizations such as universities and academic institutions.
Statement of Open Source Program：Open source data and its derivatives (including but not limited to derivative data and models) are prohibited to any form of commercial use.
Citation statement: When publishing publicly the scientific research results obtained by using all or part of the open source data, ‘Datatang AI Dataset’ must be indicated in the obtained scientific research results, and the source must be indicated at the same time: https://www.datatang.com [citation reference]
—— Datatang will reserve the final right to interpret all open source projects. ——
Open source data set aidatatang_1,505zh
1505 Hours of Mandarin Speech Data
16kHz, 16bit, wav, mono channel,wav format.
quiet indoor, including some background noise that doesn’t affect the speech recognition.
300,000 colloquial sentences
2,999 males, 3,301 females
1,481 speakers are under 20 years old; 4,412 speakers are among 21-30 years old; 244 speakers are among 31-40 years old; 163 speakers are over 40 years old
Speakers are from 34 provincial administrative region including Guangdong, Fujian, Shandong, Jiangsu, Beijing, Hunan, Jiangxi, Hong Kong, Macao, etc.
mandarin with accent
the accuracy of sentences is not less than 98%
aidatatang_200zh(note:aidatatang_200zh is part of aidatatang_1,505zh)
*CER(Character Error Rate) refers to the word recognition error rate.
*SER(Sentence Error Rate) refers to the sentence recognition error rate.
*GMM-HMM refers to mixed Gaussian model-hidden Markov model.
*TDNN(Time-delay Neural Networks) refers to the time-delay neural network model.
*Chain model refers to chain model.
Training methods based on the dataset of aidatatang_200zh.[CLIKC HERE]
(note: aidatatang_200zh is part of aidatatang_1505zh)
Open source data [1,505 hours of mandarin speech data] You can get it through the following ways：
[ Data sample ]
The sample of 1,505 Hours of Mandarin Speech Data contains 424 sentences from 7 speakers (5 males and 2 females), 10 sentences for each speaker. The samples are all taken from real data products and are partial displays of complete open source data sets.Sample download
Selected [ 200 Hours of Mandarin Speech Data ]
[200 Hours of Mandarin Speech Data] is part of [1,505 hours of mandarin speech data]. Please complete the information in the Data Application form and sign the CC agreement online. < CC Signature-Non-Commercial Use-No Deduction of 4.0 International Agreement>. Datatang will contact you and verify your information within 3 working days after receiving the information. Please ensure that this data will not be used for commercial purposes and then the download link will be sent to you by e-mail.Applicaton for the selected dataset
Full dataset [1,505 Hours of Mandarin Speech Data]
Please complete the information in the "Data Application form. Datatang will contact you within 3 working days after receiving the information and verify your information to ensure that the data will not be used for commercial purposes. Datatang will sign the "Data Use License Agreement-Datatang-Mandarin Speech Data" with you and provide you with the data offline.Application for the full dataset