INDEX
Explanations
phrases that highlight moral or ethical dilemmas related to harm and responsibility
New Auto-Interp
Negative Logits
dera
-0.20
anki
-0.16
enson
-0.15
ifornia
-0.14
520
-0.14
curtains
-0.14
cke
-0.13
arsing
-0.13
ocket
-0.13
aud
-0.13
POSITIVE LOGITS
nor
0.23
anymore
0.20
ANY
0.16
nor
0.15
anybody
0.14
licht
0.14
idia
0.14
newPosition
0.14
νοÏį
0.14
any
0.14
Activations Density 0.213%