INDEX
Explanations
phrases related to truth or accuracy
statements about truth or verification
New Auto-Interp
Negative Logits
ercise
-0.80
isphere
-0.75
igan
-0.74
reau
-0.66
cloth
-0.65
Habit
-0.63
wal
-0.62
boy
-0.58
aban
-0.58
lease
-0.56
POSITIVE LOGITS
true
3.66
true
2.82
True
2.22
TRUE
2.15
True
2.11
false
1.85
false
1.77
truth
1.60
untrue
1.47
False
1.44
Activations Density 0.027%