INDEX
Explanations
mentions of human values
references to core values and principles
New Auto-Interp
Negative Logits
女
-0.72
igans
-0.69
geon
-0.69
jen
-0.68
sie
-0.67
ready
-0.67
nih
-0.66
\/\/
-0.66
DERR
-0.65
fac
-0.64
POSITIVE LOGITS
iblings
0.80
ideals
0.76
values
0.72
tolerance
0.71
principles
0.70
Values
0.69
cape
0.69
beliefs
0.67
embodied
0.67
Advocate
0.66
Activations Density 0.027%