INDEX
Explanations
negative sentiments or criticism towards others
expressions of disdain or critique towards individuals or groups
New Auto-Interp
Negative Logits
emale
-0.83
ieth
-0.81
cially
-0.74
winner
-0.69
ahon
-0.66
iverse
-0.66
urally
-0.65
detrim
-0.64
Impact
-0.64
rimination
-0.63
POSITIVE LOGITS
concoct
0.99
indul
0.94
instinctively
0.93
obsessed
0.89
indulge
0.89
resorted
0.86
impuls
0.86
urge
0.86
craving
0.84
wandered
0.84
Activations Density 0.538%