INDEX
Explanations
negations or expressions of denial
New Auto-Interp
Negative Logits
Ïģθ
-0.16
hen
-0.16
uit
-0.15
inski
-0.15
es
-0.14
hop
-0.14
hone
-0.14
ois
-0.13
oader
-0.13
ein
-0.13
POSITIVE LOGITS
necessarily
0.25
ches
0.22
ori
0.21
quite
0.20
amp
0.18
anymore
0.18
quot
0.18
ched
0.17
CHED
0.17
rica
0.16
Activations Density 0.186%