The W2VV++ model BoW variant integrated to VIRET and SOMHunter systems has proven its effectiveness in the previous Video Browser Showdown competition in 2020. As a next experimental interactive search prototype to benchmark, we consider a simple system relying on the more complex BERT variant of the W2VV++ model, accepting a rich text input.
The input can be provided by keyboard or by speech processed by a third-party cloud service. The motivation for the more complex BERT variant is its good performance for rich text descriptions that can be provided for known-item search tasks.
At the same time, users will be instructed to specify as rich text description about the searched scene as possible.