INDEX
Explanations
terms related to evaluations and assessments of potential outcomes
New Auto-Interp
Negative Logits
ior
-0.18
vable
-0.17
max
-0.15
lem
-0.14
bable
-0.14
309
-0.14
LEM
-0.13
rior
-0.13
wu
-0.13
unthinkable
-0.13
POSITIVE LOGITS
grounds
0.24
responsible
0.23
Responsible
0.18
helpful
0.18
instrumental
0.17
beneficial
0.17
grounds
0.17
determinant
0.17
sufficient
0.17
gets
0.16
Activations Density 0.252%