INDEX
Explanations
descriptors of character traits and social status
New Auto-Interp
Negative Logits
rof
-0.16
ered
-0.16
æīį
-0.15
zon
-0.14
ering
-0.14
omor
-0.14
nty
-0.14
cara
-0.14
oner
-0.14
cope
-0.13
POSITIVE LOGITS
Äįet
0.17
olest
0.16
aren
0.14
_ble
0.14
_recall
0.14
.pp
0.13
견
0.13
lingen
0.13
uebas
0.13
_locals
0.13
Activations Density 0.009%