INDEX
Explanations
phrases related to controversial statements or actions made by public figures
instances of names and titles associated with accusations or negative labels
New Auto-Interp
Negative Logits
icion
-0.80
enture
-0.69
itely
-0.68
regate
-0.68
ovember
-0.63
pta
-0.62
few
-0.61
ptoms
-0.61
éĥ
-0.61
gomery
-0.61
POSITIVE LOGITS
unacceptable
1.08
unreliable
0.99
irresponsible
0.99
unfit
0.98
unsu
0.97
obsolete
0.97
unworthy
0.95
unethical
0.95
illegitimate
0.92
"'
0.91
Activations Density 0.141%