INDEX
Explanations
references to falsehoods or deception
instances of the word "lie" in various contexts
New Auto-Interp
Negative Logits
obs
-0.75
alg
-0.74
iles
-0.71
ugal
-0.71
ittens
-0.70
liking
-0.70
ourning
-0.68
aud
-0.68
asted
-0.68
ilation
-0.67
POSITIVE LOGITS
lie
1.15
Lie
1.02
lie
0.93
Lie
0.88
uten
0.86
lies
0.83
utenant
0.83
lied
0.81
theoret
0.81
ously
0.80
Activations Density 0.008%