This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
The spec says "applied to nodes of a cue that is not part of a region, except that the initial containing block is region." "initial containing block" is the concept used to define vw/vh: http://dev.w3.org/csswg/css-values/#viewport-relative-lengths Taken together, this would make the text size shrink inside regions.
Right, good catch. The spec is overloading the meaning of "initial containing block". That wording should really only be used for the video viewport. But here it's also used as the wording for the region's box. The definition of a region indeed is not meant to change the video viewport. But it needs its own background box.
Fixed in https://github.com/w3c/webvtt/pull/33 More work is needed in the area, but one bug at a time...