INDEX
Explanations
instances of self-reflection and commentary on societal standards
New Auto-Interp
Negative Logits
apon
-0.16
rames
-0.15
assin
-0.14
ehir
-0.14
chein
-0.14
oose
-0.14
ungi
-0.14
çe
-0.14
omba
-0.14
needle
-0.14
POSITIVE LOGITS
would
0.33
would
0.31
Would
0.28
Would
0.28
wouldn
0.27
würde
0.22
serait
0.20
seria
0.19
wäre
0.19
Wouldn
0.18
Activations Density 0.125%