INDEX
Explanations
phrases where someone is being labeled or called a negative term
instances of negative labels and accusations directed at individuals
New Auto-Interp
Negative Logits
intent
-0.71
Instruct
-0.71
imates
-0.67
adjoining
-0.65
ANS
-0.63
aday
-0.63
appl
-0.63
grounds
-0.62
Edit
-0.62
MN
-0.62
POSITIVE LOGITS
hoax
1.00
"
0.91
liar
0.88
"'
0.82
nuisance
0.80
miracle
0.78
versive
0.77
typo
0.77
fraud
0.76
'
0.75
Activations Density 0.141%