INDEX
Explanations
expressions of opinions or reactions
expressions of feelings related to satisfaction or dissatisfaction
New Auto-Interp
Negative Logits
cheat
-0.70
dated
-0.67
amen
-0.65
allow
-0.62
Ranked
-0.60
allows
-0.60
ramid
-0.59
hang
-0.59
oret
-0.58
igham
-0.58
POSITIVE LOGITS
aback
0.91
ragon
0.71
dy
0.69
hearing
0.69
by
0.68
seeing
0.65
Hearing
0.65
ienced
0.64
about
0.63
citiz
0.63
Activations Density 0.150%