INDEX
Explanations
mentions of choices and comparisons
conditional or comparative phrases
New Auto-Interp
Negative Logits
ires
-0.66
ETS
-0.65
EMP
-0.62
Ident
-0.60
successfully
-0.59
mitter
-0.59
istors
-0.58
onday
-0.58
efer
-0.58
pter
-0.57
POSITIVE LOGITS
acle
1.24
chard
1.20
acles
1.18
nam
1.13
ifice
1.09
Else
1.06
acular
1.03
chid
1.02
ific
0.99
nery
0.98
Activations Density 0.195%