Audio data as video data representation

Well, I kinda understand that it’s possible to learn it slowly, especially with some background I have (I remember I’ve programmed some Action Scrips for Flash web sites and interactive things back in the days ).

But at the same time being music producer, film/theater composer, techno live act, ambient act, mix engineer, pro photographer, clip maker, visual artist… is already too much for me : ) doing a lot but not being REALLY good in one field is a big problem, leads to lack of time, mess and stress etc… So I’m afraid adding another thing to this list basically can kill me : ) Brain explosion : ) And I understand that I don’t need a lot, just want to experiment with AI for some upcoming projects and get good results is my goal, don’t want to master this… So just a working tool with simple interface would suit me better here…


I really understand that if one starts to learn something deeper discoveries will come for sure, and it’s fascinating… So…

The modern approach to programming is you find and adapt, and only write if there are missing pieces, and large holes.

Start with as clear a vision as you can create. Be aware though, that this is always incomplete, so be flexible, and unfixed. Then flexibly try to break that vision into doable pieces.

If a piece (or the whole thing) you are looking for exists is in complete software use that and fit that in with pieces around it.

If not see if you can break that into pieces and see what you can find that’s close and can be adapted. Open source code is very helpful for this. Or use parts of existing software stuck back together.

Next down the list is to use powerful programming languages with well developed libraries, and an active community who shares software. Max / Pure Data is one. Python is another. The Wolfram tools are nice. There are others. You don’t have to stick to one programming environment, and can use pieces that fit.

Don’t be concerned if what you are creating doesn’t work in “real-time”. That can be streamlined, if you absolutely must have that.

Check back with your vision as you go forward and learn. Almost certainly you will see new things and make new visions, and adapt the visions you have.

So start with the hardest part, the clear vision — What do you want, EXACTLY ?

What do you want, EXACTLY ?


This is a pretty good article on using style GANs on audio:

1 Like

I’ve found that learning even some simple scripting has helped me in all of those areas (well, non-professional versions of most), in terms of being able to automate a lot of the repetitive work.

1 Like

:100:, and as I told, I learned a bit of coding back in the dayz…

And now when I use Touch Designer (modular audio visual environment based on „nodes“ ) and do need to use some coding inside it, I usually ask someone to help but when I read what I receive I’m more or less understand what does it means and how it organized.

Actually I’ve just ordered Monome Norns, so maybe a good idea would be to try to create something simple for it.

A little bit different way of thinking it is, as Jukka told you have to know EXACTLY what you want first. With music or other „Free“ arts often you just don’t search for anything or completely disoriented, surfing the unconscious.

I would formulate it this way:

  1. I want a Basic Tool to covert audio into specific format (mpeg or jpeg file ) to let visual scripts „understand“ and process it and after I want to be able to convert it back to sound, probably with the same tool. Not in real time at this point.

  2. As many mentioned here, I need to try fourier transform to do so, so I need to search for ready made tool for that, try to covert something concrete like a voice or acoustic instrument and try process it

  3. after this done :white_check_mark: I have convert visual (probably very abstract results) into audio again. Some told additive synthesis can help. Really don’t understand at this point, how : ) anyways…

  4. get the results, analyse What happend and adjust workflow to get something more useful and better quality. Use it creatively in upcoming projects (Filmmusik, sounddesign etc ).

At this point the idea is to use that exact scripts came from understanding that they can do really quality neural network based morphing. Interesting to see what they could bring to audio. But after I would go further and skip this Video to audio step and just Feed Audio directly to the network and see what can b done.

Thanks… will learn!

Just some thoughts regarding 1., 2. and 3.

I’ve mentioned it before, but I’ll elaborate a little more that your visualization should be as lossless as possible. From my few times that I toyed around with fourier transform – and inverse fourier transform for converting back – and such I found out that a derived spectrum can be really quite fragile (?) and if you process it even a tiny bit in the wrong place, it can start to sound really… glitchy.
Lots of pre-ringing and mp3-like artefacts or videocall-automatic-noise-reduction type of stuff – in varying amounts, depending on how far you push it of course.

So I don’t know, stay cautious when using things like jpg and such, it might not make your source completely inaudible, but it probably won’t turn out as clear as before. And depending on what the AI does do your visuals, the resulting audio might turn out completely out of whack haha, but that might not be a bad thing of course.

1 Like

Maybe not exactly what you’re looking for, but you can do some stuff with I made some software with it to create live visuals that react on incoming audio (mainly just volume and waveform). You can watch a playlist of the resuls here Zaagstof No Worries - YouTube and you can see my code on github: GitHub - JanBurp/AudioReactiveVideo: Processing code for creating visuals that react to audio


From my personal experience you set yourself up for failure when you try to follow this approach. Especially when you try to involve some FT forward and back transformation you will loose way too much details that the result will be anything than garbage.

IMHO the idea to use neural networks trained for visuals for processing audio is already a stupid idea. When you want to process audio with neural networks you use networks trained with audio.

Nevertheless when you still want to go that route forget about processing a visualization of the spectrum, but process a graphical representation of the audio data as it is (i.e. a sliding waveform display). This involves much less quality loss when converting forward and back than any spectrum visualization and it can be done with almost no effort at all even in real time.


Additive synthesis, is the reverse of a Fourier transform. A Fourier transform converts an audio representation (the measure of air pressure over time that we call sound), to a spectral representation of the same sound also over time. Different ways to represent the same thing.

In DSP lingo they call this going from the time domain to the frequency domain, if you ever read this somewhere.

Additive synthesis goes the opposite direction. It take the spectral representation and creates a regular audio representation that can make a speaker vibrate that you can hear.

Additive synthesis works by adding all the various intensities of different frequency sine waves at a particular instant together to create a more complex audio wave at that instant, and then by stringing separate different instances together to create audio over time.

If you know about wavetable synthesis, you can think of each wave in a wavetable each packed with different intensities of sine waves, and each as a particular instance. If you have a very long wavetable, that you run through fast enough, it is somewhat like running through an audio “sample”.

Wavetables in practice also generally have the ability to be repitched, playing the wave instances faster or slower, but running through the table at the same rate. Or you can change speed, playing the wave instances at the same speed, but running through the table faster. This then also starts to resemble some aspects of granular synthesis.

[ I wrote a very simple Python program, ~50 lines of code, that takes a simple collection of numbers that represents a combination at different intensities of various different frequency sine waves and adding them all together, normalizing that sum, and putting it in one position of a wavetable. ]

The brain and our sense organs also convert audio pressure waves, into other formats, one for neural transmission and others to represent what was heard in the neural system of the brain. We can learn how to do some of that in different ways, some of that is just built-in.

A good example of hardware that does additive synthesis is the Rossum Panharmonium. Thread It actually includes both sides, so can go audio to spectral, and then back from spectral to audio. It can add up to 33 oscillators, and allows you to play with the data in between.

More posts to follow.

1 Like

From your description, may I suggest you take a look at some very new free open source software from Sensel. It’s called Spectral Shiatsu, and it runs under Max For ( Ableton) Live. It allows you if you own a Sensel Morph to “massage” spectral data with your fingers. It creates an image of a sound sample that you manipulate. See this article for details:

You might, as an experiment, start with the image it creates, and do other sorts of processing, perhaps in the Max language. I think, and someone please correct me, that you might choose to do your neural net processing outside of Max For Live.

Even conventional image processing like doing dimensional distortions would be fun.

BTW: This is by no means the only software that does these sorts of things.

After some observation / experimentation / thought, refine your vision. It’s a living document.

1 Like

Thanks for so many ideas :pray:

I guess it will make me busy for a while, even if it wont work for this visual scripts plan, I learned A LOT just by creating this topic.

1 Like

Some ideas don’t work, or don’t sound good – until they do.

Seeing the invisible fatal flaw before you start is impossible if it is really invisible. You’ve got to go in a ways to find them.

Other times we’re not looking or paying attention, best not to do that, though that can create the happy accident inadvertently.


1 Like

Google’s team Magenta has done some wonderful work that parallels this topic…

Specifically the NSynth Super …. “NSynth uses a deep neural network to learn the characteristics of sounds, and create entirely new sounds based on these characteristics” … It’s an open source project that is fascinating in its own regard, but the research behind how it works is even more interesting (and extremely well documented).

Gateways to the full rabbit hole linked bellow.

NSynth: Neural Audio Synthesis

WaveNet: A generative model for raw audio | DeepMind


Super inspiring!

Actually, I really hope to live till we see hardware synths of the future, based on neural networks, deep morphing of sound “genes”, total freedom with space and time modulation etc.

Those experiments are just first steps…

1 Like

Not at all what you are looking for but take a look at Silhouette a new video to audio product being shown at Superbooth 21.

Link to my post in the Superbooth 21 thread:


@Tajnost, are you still interested in this topic ?

I found an interesting video where a group of researchers made a neural network system, using visuals of audio spectrums, and trained it to predict the settings on the u-he Diva soft synth, to create an equivalent sound. This video goes through the details of this system, it’s a little technical, but mostly understandable, if you have some background with neural network systems.

The video runs nearly 20 minutes.

This would be a neat product, for instance if you could grab a sample of a sound you’d like to make, and this system would generate a patch for your synth. It would be great if it was capable of creating patches for a group of your favorite synths, and all be that easy to do. It also seems to me something really useful for manufacturers to generate sound libraries.

ADDED : Seems to me this sort of thing would potentially be a way around some sorts of copyright restrictions.

It’s a different way to think about a “sampler”.


Just wondering if you find a solution and how the outcome does sound like.

Not sure if there is any “AI & Audio” related thread here (beside the AI and photoart thread). Elektronauts search says “Term is to short” if you type in “AI” … haha.

I was thinking about how it works to teach AI about audio, especially music. To understand complex audio signals, I guess you have to teach AI all the single elements of an audio signal (track), so how a bassdrum sounds like, how a hihat sounds like, how a 303 sounds like, … but also about dynamics, volume etc. would be interesting to discuss that further.

Regarding the your orignal questions, not sure if there is any lossless solution to transform digital audio signal in a video representation 1:1. I think the less lossiest representation could be a video stream of the waveform (you can see in an audio editor). But that video would need a framerate of 22.100 frames per second instead of 25 and a resolution of 65536 x 1 pixel (for 16bit audio) instead of 1920 x 1080. I think everything else would be an interpretation of data which leads to some kind of data loss.

1 Like

There’s an new open source project called Riffusion, that uses AI to generate images of audio spectrograms that can then be converted to sound. They use the Stable Diffusion software for this and have an interesting thing where they do image to image translation, and create a transition from one set of sounds played over time into another.

If you only listen to one sample from their web-site find the church bells to electronic beats below the picture of the windmills, half way down on this page :

1 Like