This is something that is usually taken care of by the App that's receiving the input from the microphone (Google Meet, Teams, etc). The App breaks the audio into frequencies, and the ones that correspond to human voice ranges are accepted, and anything else is rejected. This is referred to as, for example, voice isolation, and has been turned on by default in all major meeting Apps for a little while now.
Surprised to hear that it doesn't seem to work for you when the audio is generated by a different browser, this shouldn't make a difference.
Assuming OP is correct, your last sentence implies this isn't the solution being used.
Additionally, many (citation needed) Youtube videos have people talking in them; this method wouldn't help with that.
Isolating vocals in general is significantly more difficult than just relying on frequency range. Any instrument I can think of can generate notes that are squarely in the common range of a human (see: https://www.dbamfordmusic.com/frequency-range-of-instruments...)
Was trying to informally describe the use of Fourier transformations to achieve the isolation. Success will vary depending on the situation, but ML is also used in more recent cases with more uniform end results for the particular use case.
The initial question may be specific to the way one particular browser handles things to certain degree, but the comment was also trying to communicate that it can go beyond the browser and can actually be handled by the application. However, the microphone itself can also be participating at some level if it features noise suppression or some other enhancements.
The surprise about things being different when using a separate browser, come from assuming that any audio reaching the microphone should be processed equally if using FTs (or machine learning if applicable), so the audio source shouldn't matter.
Surprised to hear that it doesn't seem to work for you when the audio is generated by a different browser, this shouldn't make a difference.