INDEX
Explanations
negative or offensive language
expressions related to societal criticism and personal discontent
New Auto-Interp
Negative Logits
displayText
-0.90
UMP
-0.74
raft
-0.64
Immediately
-0.62
announced
-0.62
Payments
-0.62
reopened
-0.61
IPM
-0.61
LET
-0.61
raltar
-0.60
POSITIVE LOGITS
shitty
0.99
fucking
0.92
goddamn
0.90
shit
0.89
fuckin
0.85
sociop
0.85
fucked
0.83
patriarchy
0.81
kinda
0.81
misogyn
0.80
Activations Density 1.178%