INDEX
Explanations
statements related to truthfulness
occurrences of the word "truth."
New Auto-Interp
Negative Logits
uled
-0.86
wana
-0.77
capacity
-0.75
avy
-0.73
Rivals
-0.70
jin
-0.67
urations
-0.67
joining
-0.66
ATA
-0.66
oyal
-0.65
POSITIVE LOGITS
fulness
1.24
fully
1.06
psons
0.92
lessly
0.80
seeker
0.79
truth
0.78
ulent
0.78
serum
0.76
iness
0.76
telling
0.75
Activations Density 0.014%