INDEX
Explanations
phrases related to social or political issues, laws, and regulations
references to political ideologies and gender-related beliefs
New Auto-Interp
Negative Logits
wcs
-0.50
ensional
-0.48
minist
-0.48
odcast
-0.47
QUEST
-0.46
:=
-0.46
aback
-0.45
ciplinary
-0.44
ourning
-0.43
ptoms
-0.43
POSITIVE LOGITS
)."
0.92
").
0.90
).[
0.86
]."
0.83
%).
0.81
?).
0.81
)).
0.80
).
0.79
!).
0.78
.).
0.77
Activations Density 4.054%