INDEX
Explanations
references to individuals with specific backgrounds or professions
New Auto-Interp
Negative Logits
usra
-0.15
vant
-0.15
rende
-0.15
anker
-0.15
anky
-0.14
ender
-0.14
mpp
-0.14
anki
-0.14
andex
-0.14
usr
-0.13
POSITIVE LOGITS
byname
0.15
Integral
0.15
alive
0.14
Roch
0.14
hausen
0.14
bench
0.14
charg
0.14
ritz
0.14
른
0.14
202
0.13
Activations Density 0.019%