language models without multimodal support & audio, without slop about Gravitational Waves:
think of it this way. you have someone sitting in the Producer/Engineer chair with a comprehensive body of knowledge of everything that has been written about music theory, production, songwriting, etc -- but the speakers are off, and they can only make changes based on what they see on metering. up until the new dense Gemma models, and earlier Gemini models, audio encoding/decoding as input has largely been experimental. so when talking about models without audio byte-pair encoding training like Gemma's new dense model, we are left with vision and language tokenization, so if it can be conveyed in an image (like a mel spectrogram of a short sample, since images are rescaled down to 1500px-ish for bandwidth on most production APIs, and you will lose spectral information that could be displayed in a TIFF), that's the best the model will get
the models don't have a vague, thunderous understanding of music like detecting gravitational waves. they have the understanding a world-class engineer would have of a song with the metering on and the speakers off. "it looks like there's vocals here which are smooth" because it can see lines in the 2k-5kHz range. if it were post-trained on audio itself it would cut the gap, i.e. models that you can pass audio into natively have no problem understanding music. it's a tokenization issue with the older non-audio models but describing it the way the OP did is ridiculous dilettante wordslop
so thy kind of have synesthesia from our POV