INDEX
Explanations
references to fact-checking and political claims
New Auto-Interp
Negative Logits
mund
-0.16
hạ
-0.15
fur
-0.15
nod
-0.15
ichel
-0.14
ø
-0.14
borg
-0.14
Coin
-0.14
оÑĤ
-0.14
пÑĥ
-0.14
POSITIVE LOGITS
ersonic
0.16
asa
0.15
มาร
0.15
tamp
0.15
anners
0.15
ENCY
0.15
.hxx
0.15
ustum
0.15
aucoup
0.14
gang
0.14
Activations Density 0.007%