Concerning YouTube & Loudness

I’m being a bit stupid throughout this whole thing. If you actually just want to know what I plan to do for YouTube, skip to the end. This whole thing has me thinking I’m an idiot. But, being dumb aside, I feel like my target specs make sense for the system and I plan on testing and implementing it in the course of putting up my own company’s programming.
Frontmatter: None of this is important, you should skip it.
So, uh. Yeah. YouTube’s “loudness normalization” is half implemented at best. I am in the middle of preparing a series that my buddy and I are going to self-distro through YouTube–more on that another time–so I figured I would take a moment to revisit the common wisdom as it concerns audio specs there. In the last couple of years, I have always read the -14 LUfs stuff and thought, “that seems a bit hot,” and then just never thought much of it beyond that (there’s actually a section on my loudness reporting sheet dedicated to it though, maybe a redesign is warranted). Few clients have ever actually requested YT spec specifically, and most all material generally destined for playback on the “web” just gets the aggregated -18 LUfs with a peak of -1.2 or -1.1 dBfs from me. These “web” deployments could be a self-contained player on a client’s website, playback on more than one video platform and the editors don’t want to bother with the individual specs for each, or some amalgamation of these things. Yada, yada.
Well, I hadn’t properly seen them before, and I didn’t particularly care, but I thought, “surely YT or Google has published best practices somewhere, no?” No. No, they haven’t, at least not in a way that is accessible through normal web searches and such. The only bits of information on the street about actual specs for YT never seem to link to any actual authority, they have no connection to any actual material that YouTube or Google has published. At best they just link to other blogs or plugin developers pushing some loudness meter on you. Semi-non-sequitur: for generally understanding normalization, Mastering the Mix has a pretty good visualization of how all this (should) work and why you should care, but the table in the article is vague and…everything demands a range of 9 LU? Huh? Literally, nothing I measured had that large of a range. I guess a proper film mix probably does. But whatever, the chart still doesn’t make sense on that bit.
Okay, Here’s the Actual Stuff

By no means is this an exhaustive look at content, nor a particularly tidy one. I literally just pulled a few recent videos from channels I watch regularly, and I have always noticed a somewhat glaring disparity in level with particular channels. Before we get into it, if you’re not sure about some of these columns, here’s what’s everything is:
- Runtime: Self-explanatory, I hope
- BN: Is this uploaded by a regular (cable) broadcast network?
- Long: Long-term program measurement, or integrated measurement in LKfs (LUfs–potato, potato) over the entire duration.
- Mmax: The level of the loudest momentary measurement (a “moment” here is set to an average reading over 10 seconds here) anywhere within the program.
- Range: Again, this is pretty self-explanatory. How far is the highest bit from the lowest bit, excluding outliers in the top 5% and bottom 10%.
- TP: True peak. The highest peak in the program.
- YT “CL”: YouTube calibrated level, from the “stats for nerds.” This is what YT’s system figures the integrated loudness of the program is off by compared to their target.
- Comp: The integrated reading I took (“long”) adjusted per YT’s figure. The bottom is the average.
- NormFact: Normalization factor. Also from YT stats. This is where the dumb stuff starts, we’ll get to that and I’ll explain in a bit.
- ADJLVL: The adjusted level that YT’s system output for playback. Again, this is where the masterclass in stupid stuff beings.
- I’m using Waves Loudness Meter Plus and just running with the A/85 All (no anchor) preset.
- Also, here’s the spreadsheet if you want it.
All of this tells us a couple of things.
- The conventional broadcasters are just uploading their broadcasts as mixed for TV, which makes sense–why add another rev cycle? Unsurprising. But weirdly the TP for both SNL and Colbert are beyond the standard A/85 -2.0. And I checked, it’s not the web bumpers, it’s within the programming where those peaks hit.
- The digital-only or YT-centric channels seem to just kind of do whatever. I wonder if that is consistent across their catalogs? I’d wager not. Beside the point.
- No gain is being applied, programs are only being turned down.
Now the Stupid Stuff
So, at the end of the chart, we have the columns that show YT’s evaluation of the programming level as compared to their target. Taking the integrated loudness and adjusting them by their respective TY adjustment numbers show that, yes, -14 LUfs is pretty much the target for YT internally. Here’s where we get stupid. The NormFact thing reflects how YT is playing back the audio. So the 0.85 on the Veritasium video is just saying, “we’re turning this down by 15%.” What I didn’t expect to find here is that nothing is turned up.
And, just like me, you may be thinking, “well if the peaks are too high, they won’t apply gain,” after all, that is how the system Spotify has in place works. You can look at Spotify’s loudness documentation (someone at YouTube/Google, oi, take notes!). But what about our old buddy Joe Scott? The TP there leaves 6 whole decibels of head to work with. So the system could have brought it up by 5, playing back at -17.5 with a TP of -1.0. To a lesser degree, this is true of most the other programs I evaluated. And clearly, momentary maximums aren’t stopping the system from adding gain, Veritasium is well beyond -14 there. Here are the stats from Joe Scott vs Veritasium:

It occurs to me now that, given that the “normalized” value is a percentage, implying a maximum of 100/100, the system is probably not meant to apply gain by design. But this is kind of the whole “half” measure I mentioned that I don’t understand. Normalization is there to provide a consistent volume experience across a wide array of content. That’s the way it works everywhere else that employs loudness normalization, as far as I know. But here the possibility (and the reality, in my own experience) exists that you may watch a series of videos that are very quiet, say at -24 or even less in the case of amateur or non-professional content, slowly inch your playback device’s volume up over the course of that playback, and then be absolutely blown away by a video coming in at -14. So, like, what’s the point of it if it does not prevent that? Am I just failing to understand the fundamental purpose of a playback normalization system? Seriously, if I am just missing something, please TELL ME. This just seems stupid to me because, frankly, as it exists, turning off the entire normalization system would have no appreciable effect on playback.
If you want a visual, here are all the various waveforms compared:
 Quite the range of levels, no?
Quite the range of levels, no?
My Spec (Plan)
Considering all of this, I wanted to decide what my targets were going to be so that all the content I put on the platform is consistent, but also so that it plays well in a mix of other media from the site. It’s super simple. I want to compete with the “loud” videos without being too far up in the case that quieter programming is on before mine. I also want to protect against any kind of distortion via compression or transcoding and the like. Finally, I want to preserve dynamic range as much as possible. Not feature-length film range, but enough to have loud-ish dramatic events and soft-but-still-intelligible intimate whispers. But often range is simply a factor of the type of or subject of programming. Anyway.
Generally, I will just stick to dialogue anchored A/85. I will also read A/85 all on the master, but just as a safety–the dialogue anchor is the guiding measurement. Then for the YouTube mix, I am literally just going to add a secondary gain stage to bring the integrated loudness up to -16 ± 2 LUfs and limit true peak to -1.2 dBfs for safety. I will probably keep this gain amount a constant since that will keep the dialogue gain consistent across programming, though the overall loudness between programs beyond just the dialogue may make the all integrated readings vary a bit. Aiming for -16 should allow for some wiggle room, though, and YouTube shouldn’t push down my level unless it gets up to that -14 ish mark.
That’s pretty much it. I plan on doing some tests before officially settling things. And, of course, the spec is just a guide for me, but I figured I would share. I will update once I’ve tested things out a bit.
Till next time,
C.



