INDEX
Explanations
situations where a justification or excuse is given for certain actions
terms related to deceptive justifications or rationalizations
New Auto-Interp
Negative Logits
omer
-0.76
omers
-0.68
irc
-0.67
omb
-0.63
devices
-0.62
hani
-0.60
itter
-0.59
apsed
-0.59
eder
-0.59
ECT
-0.59
POSITIVE LOGITS
pretext
1.24
ãĥ¼ãĥĨãĤ£
0.87
milo
0.86
ual
0.85
excuse
0.83
guise
0.81
accuser
0.81
atis
0.79
ress
0.78
Tanz
0.78
Activations Density 0.016%