INDEX
Explanations
words indicating contrast or contradiction
New Auto-Interp
Negative Logits
xca
-0.15
bilm
-0.15
zin
-0.15
abar
-0.14
grab
-0.14
iswa
-0.14
.jackson
-0.14
Niet
-0.13
Barton
-0.13
.sky
-0.13
POSITIVE LOGITS
DBC
0.16
åħĦå¼Ł
0.15
åĽ
0.15
daq
0.15
erw
0.15
SOS
0.14
oce
0.14
Kraft
0.14
orks
0.14
ekler
0.13
Activations Density 0.028%