INDEX
Explanations
expressions of frustration and criticism towards political figures or situations
Profanity after articles or pronouns
swear words and insults
New Auto-Interp
Negative Logits
"),
-0.61
]='\
-0.59
,))
-0.56
_
-0.55
Искәрмәләр
-0.55
]-'
-0.54
tph
-0.52
!="")
-0.51
uxxxx
-0.50
*/}
-0.50
POSITIVE LOGITS
fuck
2.20
shit
2.09
fuck
1.98
fucking
1.95
damn
1.91
Fuck
1.91
damned
1.88
Fuck
1.84
FUCK
1.82
fucked
1.79
Activations Density 0.330%