Skip to main content

Table 1 Ablation study of image captioning performance on our dataset

From: Lifelogging caption generation via fourth-person vision in a human–robot symbiotic environment

Input perspective

Method

BLEU-1

BLEU-2

BLEU-3

BLEU-4

ROUGE-L

METEOR

CIDEr-D

SPICE

First

  

UpDown [20]

51.20

33.47

20.41

11.25

38.85

17.45

21.44

12.19

Second

 

UpDown [20]

60.86

43.24

31.12

21.19

45.60

19.46

16.94

12.08

 

Third

UpDown [20]

42.80

26.56

16.17

9.70

31.34

13.73

6.79

6.28

Second

Third

Ensemble

59.14

41.97

30.45

21.06

44.09

19.13

15.18

11.40

Second

Third

KMeans

62.31

45.34

33.16

22.91

46.22

20.19

17.76

12.21

First

 

Third

Ensemble

59.06

42.78

30.47

20.28

45.16

20.33

27.71

14.37

First

 

Third

KMeans

60.83

44.71

32.03

21.48

46.27

21.16

30.10

15.02

First

Second

 

Ensemble

62.08

45.37

32.82

22.47

47.67

21.68

30.03

15.04

First

Second

 

KMeans

62.43

45.78

32.90

22.19

47.61

21.87

30.76

15.24

First

Second

Third

Ensemble

63.12

46.37

34.08

23.71

47.92

21.72

29.52

14.99

First

Second

Third

KMeans

65.09

48.93

36.02

24.78

49.13

22.79

33.41

15.72

  1. Highest values are in italic