Fig. 12From: Lifelogging caption generation via fourth-person vision in a human–robot symbiotic environmentTop-down attention maps of KMeans and Ensemble. The yellow rectangular regions have a large attention weight and the group of regions in column conditions the decoder in generating each word. It can be observed that our proposed method KMeans discriminates instances against Ensemble. Best viewed in colorBack to article page