INDEX
Explanations
references to feelings of shame and associated concepts
New Auto-Interp
Negative Logits
esser
-0.16
ledo
-0.15
zos
-0.14
iets
-0.14
Fritz
-0.14
.
-0.14
erno
-0.14
vertime
-0.13
icut
-0.13
ockets
-0.13
POSITIVE LOGITS
lessly
0.21
fully
0.21
ishly
0.16
ously
0.16
addock
0.15
ulen
0.15
.cx
0.15
LOCKS
0.15
broken
0.15
Ñģлов
0.15
Activations Density 0.012%