INDEX
Explanations
instances of assessing moral judgments or character evaluations
New Auto-Interp
Negative Logits
enco
-0.16
hwnd
-0.15
inea
-0.15
zim
-0.15
earch
-0.14
osaur
-0.14
UNUSED
-0.14
somehow
-0.14
Gn
-0.14
Liked
-0.13
POSITIVE LOGITS
except
0.56
except
0.48
Except
0.41
Except
0.40
apart
0.39
кÑĢоме
0.36
aside
0.34
_except
0.34
éϤäºĨ
0.32
except
0.32
Activations Density 0.228%