INDEX
Explanations
discussions about values and the inconsistency in human behavior
New Auto-Interp
Negative Logits
νÏĮ
-0.17
dech
-0.16
Biz
-0.16
itag
-0.15
ltk
-0.15
ibold
-0.15
ITTE
-0.14
archae
-0.14
apiro
-0.14
ÑĤÑĢа
-0.14
POSITIVE LOGITS
Charity
0.17
GPI
0.16
util
0.16
interventions
0.15
Prison
0.15
elen
0.15
charity
0.15
EA
0.14
ocale
0.14
Slate
0.14
Activations Density 0.023%