INDEX
Explanations
phrases related to moral principles or ethical standards
New Auto-Interp
Negative Logits
ittle
-0.68
geon
-0.67
sites
-0.67
sie
-0.65
fac
-0.65
availability
-0.65
azar
-0.65
ilant
-0.65
arf
-0.64
Raid
-0.63
POSITIVE LOGITS
ideals
0.95
principles
0.90
embodied
0.87
beliefs
0.83
Values
0.82
values
0.82
creed
0.79
diversity
0.78
enshr
0.78
tenets
0.76
Activations Density 0.049%