INDEX
Explanations
sections of text labeled as "Categories."
New Auto-Interp
Negative Logits
aid
-0.16
Rubin
-0.14
Jung
-0.14
ollen
-0.14
sie
-0.14
moth
-0.14
Trev
-0.14
odom
-0.13
Shops
-0.13
ette
-0.13
POSITIVE LOGITS
rong
0.17
åĽ
0.16
GOODMAN
0.16
.foundation
0.16
má
0.15
apeut
0.15
gien
0.14
deme
0.14
ynam
0.13
ooled
0.13
Activations Density 0.004%