INDEX
Explanations
phrases or words related to justifying actions or decisions
terms related to justifying actions or decisions
New Auto-Interp
Negative Logits
OGR
-0.70
semble
-0.69
ngth
-0.68
chn
-0.67
INFO
-0.66
Sym
-0.66
ocry
-0.65
clue
-0.64
nurs
-0.63
ovych
-0.63
POSITIVE LOGITS
inaction
1.02
why
0.97
spending
0.92
cance
0.86
banning
0.86
abandoning
0.83
justifying
0.82
sacrificing
0.81
postp
0.80
imposing
0.80
Activations Density 0.046%