INDEX
Explanations
instances related to social media posts and public controversies
New Auto-Interp
Negative Logits
.LayoutStyle
-0.15
robber
-0.15
Ä©
-0.15
avigator
-0.14
akh
-0.14
rna
-0.14
éļĨ
-0.14
stabil
-0.14
Py
-0.14
destabil
-0.14
POSITIVE LOGITS
/tos
0.15
uard
0.15
911
0.14
unchecked
0.14
vio
0.14
fried
0.14
/bower
0.13
ym
0.13
DK
0.12
/dist
0.12
Activations Density 0.166%