INDEX
Explanations
words related to truth or honesty
references to the concept of truth
New Auto-Interp
Negative Logits
elig
-0.68
Pharm
-0.64
Spice
-0.61
fin
-0.61
Turk
-0.57
disadvant
-0.57
Sto
-0.57
Peninsula
-0.57
Krish
-0.57
uled
-0.56
POSITIVE LOGITS
fulness
1.61
fully
1.28
iness
1.10
telling
1.06
serum
1.00
about
0.97
ful
0.96
ulence
0.94
seekers
0.90
orial
0.88
Activations Density 0.041%