Abstract: Open-vocabulary semantic segmentation seeks to label each pixel in an image with arbitrary text descriptions. Vision-language foundation models, especially CLIP, have recently emerged as ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results