INDEX
Explanations
instances where a decision is being discussed
references to decisions made in various contexts
New Auto-Interp
Negative Logits
icas
-0.72
vae
-0.71
ingers
-0.68
icum
-0.66
havoc
-0.66
tert
-0.65
ubric
-0.64
aband
-0.64
Offense
-0.63
uction
-0.63
POSITIVE LOGITS
makers
1.04
maker
0.87
makers
0.81
maker
0.81
making
0.79
jar
0.79
ACTIONS
0.75
taken
0.72
decision
0.72
to
0.72
Activations Density 0.048%