INDEX
Explanations
statements revealing a surprising or unexpected revelation
the phrase "turns out"
New Auto-Interp
Negative Logits
cious
-0.81
oided
-0.80
avorite
-0.77
resents
-0.77
erve
-0.76
erved
-0.75
shaw
-0.71
ettlement
-0.70
cius
-0.69
heed
-0.69
POSITIVE LOGITS
âĶĢ
0.77
wards
0.76
GOODMAN
0.74
ctors
0.72
lier
0.69
Meet
0.67
¯¯¯¯¯¯¯¯
0.67
skirts
0.65
ymes
0.64
Ñĭ
0.64
Activations Density 0.021%