The ethics of a vocal edit

At the recent Adobe Max conference, there was a piece of tech presented in the Sneak Peeks section called Project Voco – essentially, Photoshop for voice.

Now, we’ve been able to edit images, replace people, remove people, add objects and move (computer vision) mountains for a long time, but there’s something spooky about this demo. I think it’s that now we have the internet we’re all much more aware of an artefact’s origin and where we mostly trust what we hear people say, software like this means the going forward, we might now question everything we hear people say.


I got myself in trouble recently with some friends on Facebook when I questioned the context of a photo which showed some hunters beside a dead animal. ‘How terrible of these people to shoot a poor defenceless animal’ was the statement above the image. However, we know nothing about the context of that photo (I am in no way pro-gun or pro-hunting) but that animal *may* have been killing other (protected) animals in the area, or *may* have been killing people in their beds at night, it *may* have been unfortunately targeted by hunters, the people in the photograph *may* have never been near a hunt in their lives but instead victims of image manipulation. We don’t know. That photo was judged by people individually based on a set of assumptions in their heads at the time they saw the image.

When I looked at the comments in that post, those who totally abhorred the image were less technically minded – those who were erring on the side of caution were more technically minded people, probably more aware of the freedom of manipulation that’s possible with software these days.

Essentially, a side-effect of image manipulation tools is that it can bring questionable context to a situation or even remove the context completely. Are we now going to have to question everything we hear?

What rings loudly for me in this Adobe demo is, that we’ve learnt to question unbelievable images because we’re mostly aware that images are manipulated – in fashion, in advertising and generally dicking about on Facebook and that’s been highlighted by the media over the years. However, we tend to trust our ears and believe what we hear people say because audio manipulation on this level has for a long time only been possible by those with expensive hardware and dedication to that as an art form.

Software such as this releases the ability for vocal manipulation to the masses. Unfortunately, amongst the masses there are going to be people with questionable ethics, or maybe just the inability or lack of foresight to appreciate the outcomes of their voice manipulation ‘joke’ that they may play out online.

Judge for yourself…

Watch the video…

What’s hidden in the demo is that it currently requires about 20 minutes of speech from someone to make this work. But it means that instead of images with a photo of Trump with text on top saying ‘He said terrible things’ we will have actual audio of him ‘saying terrible things’ and there will be less people questioning the authenticity of that content.

I don’t think that Adobe are helping themselves or this cause with the choice of content for the demo in that video, removing someone’s umming and erring from a wedding speech would have been more positive but at least this has got some of us talking about it. Maybe that was the plan?

Technically, this is pretty incredible software and should be celebrated, but once you factor in the human race, what this will ultimately be used for is unfortunately, unlikely to be positive.