NLP/Paper Review

[SentenceSimilarity] SentSim : Crosslingual Semantic Evaluation of Machine Translation 리뷰

joannekim0420 2021. 8. 18. 09:25
728x90

FOCUS

  1. Mutilingual BERT 를 이용하면 reference sentence의 필요성이 없음
  2. Sentence Semantic Similarity는 sentence embedding 과 word embedding을 linerly combine → word & compositional semantic

METHODS

  • WMD (Word Mover's Distance)

→ 문서 A와 문서 B의 비슷한 단어 간 words distance

(=computing the semantic distance between two text documents by aligning semantically similar words and capturing the word traveling flow between the similar words utilizing the vectorial relationship between their word embeddings. )

 

  • BERTScore

reference sentence 와 machine-generated sentence의 semantic similarity 를 계산. 

 

  • SSS (Semantic Sentence Similarity)

두 문장을 요약하는 두 vector들의 cosine distance으로 sentence similarity 계산

 

  • SENTSIM 

 (A: sentence-level metric , B: token-level metric)

semantic similarty → token similarity 에 적용하기 위해, semantically fine-tuned sentence embedding된 문장 cosine similarity와 contextual word embeddings를 combine 

 

DATASET

  • Multi-30k - 2018  English-German / English-French image description dataset (2000 sentence tuples each)
  • WMT17 -  German,Chinese,Latvian,Czech,Finnish,Turkish,Russian (to-English ) / Russian,Chinese (-from English) (560 sentence tuples) →main experimental data
  • WMT20 - Sinhala,Nepalese, Estonian (-to English) / German, Chinese, Romanian, Russian (-from English) (1000 sentence tuples) → crosslingual evaluation 

 

WMT17 예시

reference 문장과 비교했을 때, BERTScore가 원 문장의 부정의 의미를 갖는 문장에 더 높은 점수를 준다. 

SSS는 MT1, MT2에는 높은 점수를 주고 MT3, MT4에는 낮은 점수를 준다. 

→ BERTScore와 SSS를 combine = SentSim

 

RESULT

  • Pearson Correlation with human scores for Multi-30K with Roberta-Base in the SRC-MT(Source - Machine Translation) and MT-REF(Machine Translation - Reference) settings.

   For the latter we evaluate German to German and French to French as monolingual tasks

  • Pearson Correlation with human scores for the WMT-17 with Roberta-Base in the SRC-MT(Source - Machine Translation) setting.

→ multi-30k 와 WMT-17의 결과는 sentence length의 차이로 인해 발생.(multi-30k는 12-14word 이고, WMT17은 이 보다 길어서 word alignment 을 보는 BERTScore에서 성능이 낮고, whole sentence를 보는 WMD에서 높다)

 

  • Pearson Correlation with human score for the WMT-20 dataset with Roberta-Base in the SRC-MT setting

  • Examples from various datasets including the comparisons among BERTScore, SSS and SentSim(SSS+BERTScore)

 

 

논문 : https://aclanthology.org/2021.naacl-main.252.pdf

깃헙 : https://github.com/Rain9876/Unsupervised-crosslingual-Compound-Method-For-MT