INDEX
Explanations
statements emphasizing the importance of truth
references to the concept of "truth."
New Auto-Interp
Negative Logits
uled
-0.86
wana
-0.81
alian
-0.75
avy
-0.71
avia
-0.69
akings
-0.69
joining
-0.68
unes
-0.67
orks
-0.66
urations
-0.65
POSITIVE LOGITS
fulness
1.20
fully
1.01
truth
0.92
truth
0.84
Truth
0.82
lessly
0.81
srfAttach
0.79
seeker
0.79
psons
0.79
iness
0.78
Activations Density 0.017%