Common Voice dataset released!

Mozilla have just released their transcribed, public domain Common Voice dataset (read: spoken corpora). And with the English portion alone (yes, it’s multilingual) weighing in at a hefty 22GB, it’s nothing to be sniffed at.

Full details of the release can be found here, but for those of you too busy/lazy/cool to read through them all, I’ve provided a brief summary below.

What is it?

The Common Voice database is a collection of transcribed, public domain, multilingual speech recordings submitted by users from all over the world.

Most of the recordings are in English, but according to Mozilla’s recent post 22 languages are now represented overall. Interestingly, at the time of writing, only data from 18 of those is available for download. Presumably the other four languages either have too few recordings, or too few of their current recordings have been verified.

How big is it?

Big. Here’s a summary of the different languages recorded in terms of size.

Language	Number of voices	Size
English	33,541	22 GB
German	2249	4 GB
French	1697	2 GB
Kabyle	382	2 GB
Catalan	1639	2 GB
Chinese (Taiwan)	695	800 MB
Welsh	365	622 MB
Italian	313	556 MB
Tatar	117	555 MB
Dutch	373	382 MB
Breton	82	201 MB
Esperanto	53	184 MB
Turkish	203	182 MB
Kyrgyz	63	152 MB
Hakha Chin	253	124 MB
Slovenian	18	98 MB
Chuvash	33	77 MB
Irish	30	48 MB
	Total	33.98 GB

(Yes, I had to look up Kabyle and Chuvash too.)

Each language is available for download separately. So if you’re itching to train your Slovenian voice assistant, you don’t need to download the whole 34 GB.

What can I do with it?

Whatever you want! It’s public domain, so you’re free to use it as you please - for research, building an app, samples in your next mixtape. It’s up to you. The only caveat here is that by downloading any part of the dataset, you agree not to try to identify individual speakers.

Closing remarks

As someone who’s contributed to Common Voice (in a very modest way), both by reading and verifying sentences, it’s particularly gratifying to see a release like this. What’s more, the dataset is continuing to grow, so if this sounds like your kettle of fish,¹ you can contribute via the Common Voice website.

I haven’t yet had the chance to download it and mess around, but I can’t wait. After I’ve poked around it’s innards, I’ll be reporting back here for sure.

In my case, I’m not sure what an appropriate vegetarian equivalent to this phrase would be: Can of chickpeas? Brand of tofu? Packet of seitan? ↩︎

FumbLing

A blog about linguistics

Common Voice dataset released!

What is it?

How big is it?

What can I do with it?

Closing remarks