INDEX
Explanations
references to physical actions and their consequences
New Auto-Interp
Negative Logits
ollen
-0.15
allo
-0.14
ADV
-0.13
syn
-0.13
Sug
-0.13
orb
-0.13
Modern
-0.13
заб
-0.13
fal
-0.13
illa
-0.13
POSITIVE LOGITS
_CID
0.16
Ñĩе
0.15
éĩ
0.15
usch
0.15
üf
0.14
iy
0.14
ÙIJÙĩ
0.13
kop
0.13
uggy
0.13
czy
0.13
Activations Density 0.054%