AES: The AVAR Conference Report Continued
I promised additional discussion regarding my experience at the AES AVAR (Audio for Virtual and Augmented Reality) conference from a couple of weeks ago. I’m sorry for the delay in getting back to the topic — I was in Colorado and distracted with other things. But the world of VR is a booming marketplace and it I wanted to make sure and get back to it before getting on to other things.
Despite the fact that Virtual Reality has been around for many years, the next three to five years will see the market explode. The estimates that I’ve read talk about VR being a $120 billion dollar market within the next 3-5 years. There are over 4000 production companies working on VR productions and technology companies are scrambling to develop tools and procedures to develop compelling content. The early experiments that I’ve seen and heard have been pretty limited. Would I put my iPhone in a Google cardboard VR viewer or strap a video screen to my head on a regular basis? No. My interest is whether the development of audio/music for VR or Augmented Reality has merit. I want to know how to capture of create audio for VR, how to post it, and how to distribute it. And I want to see if the music industry will have a piece of this market or will it be limited to games and other non-music experiences.
The AVAR conference was full of technologists and academics showing off their wares. They discussed the challenges of convincing people that a virtual sound experience can be as real as real life. This means personalizing the delivery of audio channels to my ears using measured HRTFs (head related transfer functions) for my own ears. And it means having the ability to track the movement and location of my head. These things are non trivial and require much more than powerful processors and clever algorithms. How many people have their own HRTF stored on their computer or smartphone? How would you even go about getting your head and ears measured?
What about the aesthetics of music production via VR? Does it make sense to place a soundfield microphone or quad binaural microphone (yes, they exist) in the midst of a performing ensemble and then deliver it via headphones with motion tracking to VR listeners? The people at Sony Music showed off a VR music video at a recent event and limited the audio to the standard released stereo master. They disconnected the visual experience from the soundtrack — which diminishes the whole thing. My Blu-ray productions don’t try to be VR experiences but they do have compelling video (some in 3D) and 5.1 surround sound. The most common complaint or comment I get from customers is the disconnect between the video and the audio. They see a drum kit on the left hand side of the stage and hear the drums on the right side of the 5.1 surround mix. They ask if I can’t fix it or make the music mixes track the visuals. No, I can’t. In my productions, the music comes first. The video is a bonus. If you feel the video and audio are at odds with each other, switch the video off!
The presenters at the AES AVAR event didn’t address the problems of getting VR video to “sync” with music. In fact, one guy actually panned the lead singer of a live music performance around the 3D space to “lock” the vocal track to the video. Never mind that they whole mix was destroyed. We are in the early days of VR and music. It reminds me of the dawn of 5.1 surround mixing all those years ago. There are no templates to follow.
My fear is that the excitement of the new technology will relegate the power of a well-recorded music track to a supporting role behind a 360 video. Music should maintain its place as a premiere experience. The VR guys have got to figure this out. And what I saw and heard at the AVAR, they are a long way from doing so.
The audio portion of AVAR has a long way to go if it is ever going to work satisfactorily. This is especially true if the sound is to come through headphones. Now the fatal conceptual flaws in the current audio technology will come face to face with the reality of hearing capabilities and the required advance will likely be a long way off. I’ve been experimenting with constructing and understanding sound fields and how they are perceived for over 40 years. Here is what the required specification will be. The sound must change from one track to another as you move your head and the tracks will have to be both horizontal and vertical. Incremental angle of incidence between adjacent tracks must be as good or better than the pair of human ears and brain’s resolving power for detecting angle of incidence. And the change will have to be as fast or faster than the human brain can detect. That means from the turn of one’s head in any plane or direction, detecting and switching will have to be complete within two to five microseconds. Fail this and it won’t work. The audio will always lag the video and will be disconnected from it. The direction of sound will not keep up or coincide with the direction of what is seen. Junky AVAR will be even worse than the junky two and five channel formats audiophiles are affixed to. There won’t be any faking it this time. I predict it will be a long time before “THEY” figure it out. As you know I don’t think they are any too sharp.
Mark Wrote: “The people at Sony Music showed off a VR music video at a recent event and limited the audio to the standard released stereo master. They disconnected the visual experience from the soundtrack — which diminishes the whole thing”.
Mark then wrote: “…They ask if I can’t fix it or make the music mixes track the visuals. No, I can’t. In my productions, the music comes first. The video is a bonus. If you feel the video and audio are at odds with each other, switch the video off!”
You countered your own argument.
It is possible to present video and audio where there is a spatial mismatch between audio and video without one format distracting from the other, the BBC do this all the time with their Live Lounge videos: https://www.youtube.com/channel/UC-FQUIVQ-bZiefzBiQAa8Fw .
Just a little attention to detail could make the difference between a video that distracts from the audio and one that enhances it.
Dave, not quite. For me the music comes first and the video second. For the VR people, the visual is the primary component of the experience. I’m not sure anyone has figured it out. I’ll take a look atht eBBC Live Lounge videos.
The proof that the static model of sound and hearing doesn’t work is the abject and immediate failure of binaural recordings played through headphones. It meets al three criteria of the static model, time of arrival, loudness of arrival, and HRTF. Yet it fails anyway. This is because being fixed to your head, it moves with your head. The sound field is the equivalent of two scalars, not vectors.
It’s not a matter of which is more important, sound or sight. They work together. Because sound is heard in 360 degrees in all planes while the field of sight is far more limited, Sound can be relied on to identify the direction it is coming by exploiting the fact that with the slightest turn of the head, your brain has evolved over billions of years of evolution to compare the change in time of arrival between your ears with the change in the position of your head. It is no accident that the organ which senses head position is directly adjacent to the organ which hears. Your immediate instinct is to look at the direction sound comes from. Whether you are predator or prey, this ability to use sight and sound together is one of the most important tools for survival.
The reflected sounds whose direction you usually cannot identify is far from worthless. By associating it and comparing it with the first arriving sound, it tells you a lot about the nature of your surroundings.
I don’t agree that binaural is an “abject and immediate” failure….in fact, it’s very dramatic and immediately immersive. No, it doesn’t track but it brings the sound outside of your head.
Were it true then it would be the ideal recording method. It hears what you would hear and puts that sound right where you’d hear it. Why isn’t it more popular? Why hasn’t it replaced stereophonic sound, an altogether different two channel technology? Because as soon as your turn your head even slightly, your brain immediately comes to the only possible conclusion it can, that the source of sound is inside your head or at best right outside your ear. This is because when the sound turns with your head, it is the equivalent of two scalar fields, not a vector field. It has no directional properties. How disappointing it must have been to those who first thought it up. They knew it didn’t work, they knew that rotating your head made its failure immediately obvious, but they never asked the question why it failed let alone learned from the answer.
There are whole catalogs of binaural recordings available — I made one for the Pasadena Symphony Orchestra years ago. The problem isn’t with the technology, it’s with the production methodologies of the major record labels. You’re right that the sound doesn’t track when you move your head (except with systems like the Smyth Realizer), but there are other qualities that make it improbable as a commercial format.
You can watch Choueiri’s video and listen through headphones and convince yourself there is a bee buzzing around your head behind it. But close your eyes and the effect is lost. It’s buzzing inside your head or just outside your ears.. You’d have to have a sufficient number of binaural recordings so that the minimum angle vertically and horizontally between one and the next is better than you can detect aurally and the system would have to sense the movement of your head and switch to another pair of tracks within the time you can detect the difference of arrival times at each ear which is between 2 and 5 microseconds. This is what it would take to make a large number of scalars seem like a vector field. This is far beyond the capability of current technology.