INDEX
Explanations
expressions or mentions related to moral principles or beliefs
mentions of abstract principles or moral standards
New Auto-Interp
Negative Logits
geon
-0.77
brain
-0.74
--------------------------------------------------------
-0.72
rontal
-0.71
Runner
-0.69
sie
-0.68
interrupted
-0.68
waves
-0.67
slow
-0.67
wards
-0.66
POSITIVE LOGITS
Values
0.92
values
0.90
values
0.75
ideals
0.74
Values
0.73
embodied
0.73
proposition
0.72
zsche
0.71
value
0.69
propositions
0.67
Activations Density 0.013%