INDEX
Explanations
phrases related to distinguishing truth from misinformation
New Auto-Interp
Negative Logits
/trunk
-0.19
retro
-0.14
rych
-0.14
tech
-0.14
estone
-0.14
.basic
-0.14
lat
-0.14
.modelo
-0.14
aise
-0.13
Retro
-0.13
POSITIVE LOGITS
ukan
0.16
baise
0.16
Lies
0.16
_GU
0.16
wind
0.15
íķµ
0.15
.Focus
0.14
noise
0.14
noise
0.14
oref
0.14
Activations Density 0.163%