INDEX
Explanations
negative actions or behaviors related to attacks on reputation, characterized by words like "smear," "slander," "defamation," and "distortions."
terms associated with attacks on reputation and character
New Auto-Interp
Negative Logits
jo
-0.78
iration
-0.74
Wond
-0.74
Ele
-0.74
autom
-0.73
hook
-0.72
Happiness
-0.72
HT
-0.72
aw
-0.69
Zen
-0.68
POSITIVE LOGITS
smear
3.21
slander
1.77
libel
1.73
defamation
1.72
disinformation
1.62
misinformation
1.61
distort
1.60
wedge
1.55
distortion
1.53
distortions
1.51
Activations Density 0.051%