INDEX
Explanations
negative or contradictory assertions in relation to personal knowledge or capability
New Auto-Interp
Negative Logits
anim
-0.17
oke
-0.16
ing
-0.15
ëĵł
-0.15
McCart
-0.14
ะ
-0.14
urs
-0.14
sel
-0.14
eu
-0.14
ubre
-0.14
POSITIVE LOGITS
lify
0.18
forth
0.17
theless
0.17
å¤ķ
0.16
iesen
0.14
kaç
0.14
rega
0.14
eny
0.14
hoff
0.14
thing
0.14
Activations Density 0.016%