INDEX
Explanations
mentions of conscience or moral principles
references to moral awareness or ethical considerations
New Auto-Interp
Negative Logits
eri
-0.84
olds
-0.70
ORPG
-0.68
ramid
-0.67
enty
-0.67
ishers
-0.65
Extended
-0.65
vati
-0.65
atern
-0.65
anas
-0.65
POSITIVE LOGITS
conscience
0.92
fulness
0.87
ful
0.69
OME
0.67
disposition
0.66
compass
0.66
less
0.65
ngth
0.65
FUL
0.65
disobedience
0.64
Activations Density 0.018%