INDEX
Explanations
the word "allow"
New Auto-Interp
Negative Logits
y
-0.75
’
-0.68
er
-0.68
allow
-0.64
allow
-0.62
allowed
-0.59
en
-0.57
®
-0.57
-0.57
芒
-0.55
POSITIVE LOGITS
Efq
1.29
pleaſure
1.20
himſelf
1.18
myſelf
1.14
Jefus
1.12
Eſ
1.10
itſelf
1.05
Anſ
1.05
extAlignment
1.05
Diſ
1.04
Activations Density 4.128%