INDEX
Explanations
references to laws or prohibitions
New Auto-Interp
Negative Logits
ocks
-0.16
alet
-0.16
andler
-0.15
yk
-0.15
aid
-0.15
ickey
-0.14
chos
-0.14
.ua
-0.14
ÏĢον
-0.14
cul
-0.13
POSITIVE LOGITS
ishment
0.18
adoo
0.17
semble
0.15
زد
0.14
hatt
0.14
ala
0.14
veal
0.14
ioneer
0.14
DEM
0.14
itore
0.14
Activations Density 0.026%