INDEX
Explanations
statements expressing honesty or frankness
New Auto-Interp
Negative Logits
etting
-0.69
Blades
-0.68
arthy
-0.66
ammy
-0.65
tailed
-0.65
ied
-0.64
Landing
-0.64
tein
-0.60
Klu
-0.59
lav
-0.59
POSITIVE LOGITS
speaking
1.04
é¾įåĸļ士
0.85
zers
0.84
speaking
0.75
honestly
0.73
ometry
0.67
admit
0.67
tho
0.66
doubted
0.65
ashamed
0.64
Activations Density 0.032%