AI Data Open Source Program is an academic support program launched by Datatang for non-commercial organizations such as universities and academic institutions. It aims to empower global AI academic research. Datatang will continue to provide AI training data sets to academic researchers to help them overcome difficulties in terms of data, give full play to their professional abilities, realize technological progress and promote social development.

Welcome to join the Open Source Program of Datatang in building an intelligent era.

Open Source Program Object Oriented: open source data for non-commercial organizations such as universities and academic institutions.

Statement of Open Source Program:Open source data and its derivatives (including but not limited to derivative data and models) are prohibited to any form of commercial use.

Citation statement: When publishing publicly the scientific research results obtained by using all or part of the open source data, ‘Datatang AI Dataset’ must be indicated in the obtained scientific research results, and the source must be indicated at the same time: [citation reference]

—— Datatang will reserve the final right to interpret all open source projects. ——

Open source data set aidatatang_1505zh

1505 Hours of Mandarin Speech Data

Data presentation
[1505 Hours of Mandarin Speech Data] With 1505 valid hours, it is part of the Chinese Speech Dataset of Datatang. The recording contents are 300,000 colloquial sentences and are recorded from 6408 speakers from different regions of China. Transcribed and annotated by professional phonetic proofreading personal and under strict quality check, the accuracy rate of this data has reached over 98%, which is the highest standard of sentence accuracy in the industry.
Data product details
Data format

16kHz, 16bit, wav, mono channel,wav format.

Recording environment

quiet indoor, including some background noise that doesn’t affect the speech recognition.

Recording content

300,000 colloquial sentences


6408 people;

2999 males, 3301 females

1481 speakers are under 20 years old; 4412 speakers are among 21-30 years old; 244 speakers are among 31-40 years old; 163 speakers are over 40 years old

Speakers are from 34 provincial administrative region including Guangdong, Fujian, Shandong, Jiangsu, Beijing, Hunan, Jiangxi, Hong Kong, Macao, etc.


android: ios=9:1


mandarin with accent

Application scenario

speech recognition

machine translation

voiceprint recognition

Accuracy rate

the accuracy of sentences is not less than 98%

Use effect

aidatatang_200zh(note:aidatatang_200zh is part of aidatatang_1505zh)

12.22% 43.11% 7.14% 31.19% 5.59% 26.06%


7.35% 35.98% 3.14% 23.05%


*CER(Character Error Rate) refers to the word recognition error rate.

*SER(Sentence Error Rate) refers to the sentence recognition error rate.

*GMM-HMM refers to mixed Gaussian model-hidden Markov model.

*TDNN(Time-delay Neural Networks) refers to the time-delay neural network model.

*Chain model refers to chain model.

Training methods

Training methods based on the dataset of aidatatang_200zh.


(note: aidatatang_200zh is part of aidatatang_1505zh)


Open source data [1505 hours of mandarin speech data] You can get it through the following ways:

[ Data sample ]

The sample of 1505 Hours of Mandarin Speech Data contains 424 sentences from 7 speakers (5 males and 2 females), 10 sentences for each speaker. The samples are all taken from real data products and are partial displays of complete open source data sets.

Sample download

Selected [ 200 Hours of Mandarin Speech Data ]

[200 Hours of Mandarin Speech Data] is part of [1505 hours of mandarin speech data]. Please complete the information in the Data Application form and sign the CC agreement online. < CC Signature-Non-Commercial Use-No Deduction of 4.0 International Agreement>. Datatang will contact you and verify your information within 3 working days after receiving the information. Please ensure that this data will not be used for commercial purposes and then the download link will be sent to you by e-mail.

Applicaton for the selected dataset

Full dataset [1505 Hours of Mandarin Speech Data]

Please complete the information in the "Data Application form. Datatang will contact you within 3 working days after receiving the information and verify your information to ensure that the data will not be used for commercial purposes. Datatang will sign the "Data Use License Agreement-Datatang-Mandarin Speech Data" with you and provide you with the data offline.

Application for the full dataset

Cooperative institutions
