INDEX
Explanations
phrases related to negative or harmful actions or characteristics
terms related to negative or harmful actions and sentiments
New Auto-Interp
Negative Logits
inet
-0.94
ais
-0.85
inances
-0.81
rozen
-0.80
ered
-0.79
orius
-0.79
liner
-0.78
alist
-0.78
lique
-0.77
arb
-0.75
POSITIVE LOGITS
nasty
0.87
smear
0.82
terday
0.79
surprises
0.75
poisons
0.73
poison
0.72
soever
0.70
mud
0.69
hello
0.69
slander
0.66
Activations Density 0.032%