INDEX
Explanations
phrases and concepts related to truth and transparency
New Auto-Interp
Negative Logits
ully
-0.18
sg
-0.17
alu
-0.16
sm
-0.16
otel
-0.16
AYS
-0.15
å¿Ĺ
-0.15
BÃł
-0.15
ati
-0.15
381
-0.14
POSITIVE LOGITS
truth
0.22
Truth
0.21
truth
0.21
Truth
0.20
ãĥ³ãĤº
0.18
Expose
0.18
ÏģοÏį
0.17
truths
0.17
verdad
0.17
freeing
0.16
Activations Density 0.078%