INDEX
Explanations
phrases indicating surprise or contradiction
the word "actually" in various contexts
New Auto-Interp
Negative Logits
lain
-0.73
illed
-0.69
fu
-0.69
bye
-0.68
wich
-0.68
cit
-0.67
oute
-0.66
heid
-0.63
legged
-0.63
ado
-0.63
POSITIVE LOGITS
comprom
0.88
meant
0.83
bothering
0.81
olkien
0.80
REALLY
0.77
metic
0.73
bother
0.72
quite
0.71
intended
0.68
actually
0.67
Activations Density 0.024%