Common Voice dataset released!

Mozilla have just released their transcribed, public domain Common Voice dataset (read: spoken corpora). And with the English portion alone (yes, it’s multilingual) weighing in at a hefty 22GB, it’s nothing to be sniffed at.

Full details of the release can be found here, but for those of you too busy/lazy/cool to read through them all, I’ve provided a brief summary below.

What is it?

The Common Voice database is a collection of transcribed, public domain, multilingual speech recordings submitted by users from all over the world.

Most of the recordings are in English, but according to Mozilla’s recent post 22 languages are now represented overall. Interestingly, at the time of writing, only data from 18 of those is available for download. Presumably the other four languages either have too few recordings, or too few of their current recordings have been verified.

How big is it?

Big. Here’s a summary of the different languages recorded in terms of size.

Language Number of voices Size
English 33,541 22 GB
German 2249 4 GB
French 1697 2 GB
Kabyle 382 2 GB
Catalan 1639 2 GB
Chinese (Taiwan) 695 800 MB
Welsh 365 622 MB
Italian 313 556 MB
Tatar 117 555 MB
Dutch 373 382 MB
Breton 82 201 MB
Esperanto 53 184 MB
Turkish 203 182 MB
Kyrgyz 63 152 MB
Hakha Chin 253 124 MB
Slovenian 18 98 MB
Chuvash 33 77 MB
Irish 30 48 MB
Total 33.98 GB

(Yes, I had to look up Kabyle and Chuvash too.)

Each language is available for download separately. So if you’re itching to train your Slovenian voice assistant, you don’t need to download the whole 34 GB.

What can I do with it?

Whatever you want! It’s public domain, so you’re free to use it as you please - for research, building an app, samples in your next mixtape. It’s up to you. The only caveat here is that by downloading any part of the dataset, you agree not to try to identify individual speakers.

Closing remarks

As someone who’s contributed to Common Voice (in a very modest way), both by reading and verifying sentences, it’s particularly gratifying to see a release like this. What’s more, the dataset is continuing to grow, so if this sounds like your kettle of fish,1 you can contribute via the Common Voice website.

I haven’t yet had the chance to download it and mess around, but I can’t wait. After I’ve poked around it’s innards, I’ll be reporting back here for sure.

  1. In my case, I’m not sure what an appropriate vegetarian equivalent to this phrase would be: Can of chickpeas? Brand of tofu? Packet of seitan? ↩︎