INDEX
Explanations
questions and statements related to knowledge or understanding
New Auto-Interp
Negative Logits
sinon
-0.15
able
-0.15
oblin
-0.14
coincidence
-0.14
Feld
-0.14
Guill
-0.14
ible
-0.14
denn
-0.14
conduct
-0.14
inch
-0.14
POSITIVE LOGITS
talking
0.20
Talking
0.19
signing
0.19
Signing
0.17
fos
0.16
wert
0.16
Hell
0.15
talk
0.15
-talk
0.15
Talk
0.14
Activations Density 0.033%