Common Voice dataset released!
Mozilla have just released their transcribed, public domain Common Voice dataset (read: spoken corpora). And with the English portion alone (yes, it’s multilingual) weighing in at a hefty 22GB, it’s nothing to be sniffed at.
Full details of the release can be found here, but for those of you too busy/lazy/cool to read through them all, I’ve provided a brief summary below.
What is it?
The Common Voice database is a collection of transcribed, public domain, multilingual speech recordings submitted by users from all over the world.
Most of the recordings are in English, but according to Mozilla’s recent post 22 languages are now represented overall. Interestingly, at the time of writing, only data from 18 of those is available for download. Presumably the other four languages either have too few recordings, or too few of their current recordings have been verified.
How big is it?
Big. Here’s a summary of the different languages recorded in terms of size.
Language | Number of voices | Size |
---|---|---|
English | 33,541 | 22 GB |
German | 2249 | 4 GB |
French | 1697 | 2 GB |
Kabyle | 382 | 2 GB |
Catalan | 1639 | 2 GB |
Chinese (Taiwan) | 695 | 800 MB |
Welsh | 365 | 622 MB |
Italian | 313 | 556 MB |
Tatar | 117 | 555 MB |
Dutch | 373 | 382 MB |
Breton | 82 | 201 MB |
Esperanto | 53 | 184 MB |
Turkish | 203 | 182 MB |
Kyrgyz | 63 | 152 MB |
Hakha Chin | 253 | 124 MB |
Slovenian | 18 | 98 MB |
Chuvash | 33 | 77 MB |
Irish | 30 | 48 MB |
Total | 33.98 GB |
(Yes, I had to look up Kabyle and Chuvash too.)
Each language is available for download separately. So if you’re itching to train your Slovenian voice assistant, you don’t need to download the whole 34 GB.
What can I do with it?
Whatever you want! It’s public domain, so you’re free to use it as you please - for research, building an app, samples in your next mixtape. It’s up to you. The only caveat here is that by downloading any part of the dataset, you agree not to try to identify individual speakers.
Closing remarks
As someone who’s contributed to Common Voice (in a very modest way), both by reading and verifying sentences, it’s particularly gratifying to see a release like this. What’s more, the dataset is continuing to grow, so if this sounds like your kettle of fish,1 you can contribute via the Common Voice website.
I haven’t yet had the chance to download it and mess around, but I can’t wait. After I’ve poked around it’s innards, I’ll be reporting back here for sure.
-
In my case, I’m not sure what an appropriate vegetarian equivalent to this phrase would be: Can of chickpeas? Brand of tofu? Packet of seitan? ↩︎