INDEX
Explanations
phrases or words indicating a false justification or reason for a particular action or belief
terms related to justifications or excuses for actions
New Auto-Interp
Negative Logits
elong
-0.74
Life
-0.70
itivity
-0.70
Die
-0.69
evolve
-0.68
igr
-0.63
AMI
-0.63
average
-0.62
Surv
-0.62
life
-0.62
POSITIVE LOGITS
pretext
3.84
guise
2.09
spurious
1.38
bogus
1.36
provocation
1.29
disguise
1.24
phony
1.24
dubious
1.22
pretended
1.18
euphem
1.18
Activations Density 0.027%