INDEX
Explanations
positive appraisal and understanding
New Auto-Interp
Negative Logits
Agreed
0.51
disagreed
0.50
好吧
0.49
disagree
0.46
的态度
0.45
calmed
0.45
नाराजगी
0.44
ok
0.44
Alright
0.44
agree
0.44
POSITIVE LOGITS
fascinating
0.82
интересный
0.66
clever
0.64
ingenious
0.61
interesting
0.60
Interesting
0.59
Interesting
0.59
Nowadays
0.57
интерес
0.55
marvelous
0.54
Activations Density 0.005%