INDEX
Explanations
capitalized names and terms related to authority or prominence
New Auto-Interp
Negative Logits
eman
-0.17
Twist
-0.17
olest
-0.16
ighton
-0.16
ext
-0.15
yg
-0.15
emann
-0.15
er
-0.15
zelf
-0.15
ksi
-0.14
POSITIVE LOGITS
orca
0.23
enger
0.21
llll
0.19
engers
0.18
iance
0.18
tid
0.18
tach
0.17
ender
0.17
ure
0.17
acks
0.17
Activations Density 0.005%