INDEX
Explanations
mentions of negative incidents in society, such as harassment, violence, and tragedies
references to violence and conflict
New Auto-Interp
Negative Logits
Firstly
-0.84
\)
-0.75
Firstly
-0.73
Appearance
-0.70
Very
-0.69
Primary
-0.69
,''
-0.69
Material
-0.67
.}
-0.67
operation
-0.67
POSITIVE LOGITS
tsun
0.78
cannibal
0.76
poisoned
0.75
coughing
0.73
inexpl
0.72
Kardashian
0.71
disgr
0.69
botched
0.69
grizz
0.68
assass
0.68
Activations Density 1.039%