INDEX
Explanations
statements or references related to claims of truthfulness
New Auto-Interp
Negative Logits
drowned
-0.15
983
-0.14
Bench
-0.14
éļı
-0.14
adox
-0.14
ournals
-0.13
antal
-0.13
æ½
-0.13
unsuccessful
-0.13
iction
-0.13
POSITIVE LOGITS
exposing
0.20
expose
0.18
exposure
0.18
truth
0.18
exposures
0.17
exposes
0.17
Expose
0.17
wakeup
0.17
readers
0.16
Truth
0.16
Activations Density 0.504%