Fig. 3From: Lifelogging caption generation via fourth-person vision in a human–robot symbiotic environmentBaseline model UpDpwn [20]. The block Attn denotes the attention module. \(v_t\) and \(p_t\) denote an attended feature and vocabulary logits at time step t, respectively. \(h_{t-1}\) denotes the previous states of the RNN decoderBack to article page