INDEX
Explanations
negative adjectives and phrases related to criticism and bias
terminology related to falsehoods and negative attributes in reports or statements
New Auto-Interp
Negative Logits
cade
-0.84
ynthesis
-0.83
lear
-0.80
Waves
-0.78
onds
-0.78
uese
-0.77
yles
-0.77
eatures
-0.77
ESE
-0.76
lights
-0.75
POSITIVE LOGITS
disrespectful
1.43
unethical
1.43
immoral
1.41
irresponsible
1.41
wasteful
1.35
prejud
1.34
hypocritical
1.34
counterproductive
1.33
sexist
1.31
racist
1.30
Activations Density 0.269%