INDEX
Explanations
expressions of commendation or praise
New Auto-Interp
Negative Logits
etra
-0.58
into
-0.55
Levin
-0.53
Slo
-0.52
ne
-0.52
nage
-0.51
b
-0.51
зму
-0.51
__))
-0.50
costi
-0.50
POSITIVE LOGITS
praise
1.87
praises
1.74
praising
1.72
praised
1.71
praise
1.70
applaud
1.54
Praise
1.53
commend
1.48
commendation
1.43
Praise
1.39
Activations Density 0.163%