INDEX
Explanations
phrases and words related to moral judgment and ethical considerations
New Auto-Interp
Negative Logits
Ñħи
-0.18
od
-0.17
elegance
-0.15
Od
-0.15
itaire
-0.14
ourt
-0.14
.omg
-0.14
core
-0.13
redi
-0.13
rig
-0.13
POSITIVE LOGITS
because
0.57
because
0.52
porque
0.48
Because
0.48
Because
0.46
поÑĤомÑĥ
0.40
åĽłä¸º
0.39
karena
0.39
omdat
0.38
perché
0.37
Activations Density 0.217%