INDEX
Explanations
negations and questions related to personal circumstances or opinions
New Auto-Interp
Negative Logits
whose
-0.17
unday
-0.16
who
-0.16
’Ãł
-0.16
,
-0.15
äs
-0.15
sire
-0.14
Roe
-0.14
leftright
-0.14
inand
-0.14
POSITIVE LOGITS
sav
0.19
pou
0.17
aur
0.15
compt
0.15
eskort
0.15
cle
0.15
mange
0.15
conn
0.15
ROID
0.15
gles
0.15
Activations Density 0.014%