INDEX
Explanations
references to controversies or conflicts
expressions of concern or questions about societal issues
New Auto-Interp
Negative Logits
TAG
-0.70
,''
-0.69
bole
-0.68
agree
-0.66
\)
-0.64
®
-0.64
tion
-0.63
+)
-0.63
Spoiler
-0.63
================================
-0.62
POSITIVE LOGITS
sushi
0.75
Okin
0.73
phalt
0.73
dentist
0.73
Golf
0.70
Chilean
0.66
eyebrow
0.66
Hamb
0.66
Tos
0.65
eyel
0.64
Activations Density 1.410%