INDEX
Explanations
words related to revealing information or secrets
New Auto-Interp
Negative Logits
Nadu
-0.71
pity
-0.65
breath
-0.65
obsc
-0.64
OPS
-0.64
isu
-0.63
istically
-0.62
opsy
-0.60
stature
-0.60
isks
-0.60
POSITIVE LOGITS
llers
1.41
ller
1.30
rence
1.25
lling
1.20
ª
1.16
lez
1.11
aled
1.08
rent
1.05
lled
1.02
led
1.02
Activations Density 0.041%