INDEX
Explanations
phrases that emphasize the presence of a "fact" or assert statements about reality
New Auto-Interp
Negative Logits
ryn
-0.16
nte
-0.16
ensis
-0.15
ILLISE
-0.15
ould
-0.14
еÑĢеÑĩ
-0.14
nek
-0.13
ILON
-0.13
thus
-0.13
룬
-0.13
POSITIVE LOGITS
fact
0.21
itious
0.20
uality
0.18
ually
0.16
arding
0.15
zik
0.14
fact
0.14
annel
0.13
umas
0.13
dehy
0.13
Activations Density 0.021%