INDEX
Explanations
specific names and titles of people or literary works
New Auto-Interp
Negative Logits
spy
-0.15
dish
-0.14
kip
-0.14
elite
-0.14
Assignment
-0.14
Spy
-0.14
ucer
-0.14
仪
-0.14
hil
-0.14
dep
-0.14
POSITIVE LOGITS
rac
0.23
Rac
0.20
rac
0.19
Quad
0.17
opard
0.17
scaff
0.17
Editor
0.16
aut
0.16
ese
0.16
editor
0.16
Activations Density 0.015%