INDEX
Explanations
The neuron flags words and phrases marking controversies or scandals (e.g. “offensive,” “resurfaced,” “controversy,” “#MeToo”).
New Auto-Interp
Negative Logits
ани
-0.07
deliver
-0.07
.mu
-0.06
IBC
-0.06
IMER
-0.06
ESS
-0.06
енд
-0.06
EE
-0.06
spanking
-0.06
-legged
-0.06
POSITIVE LOGITS
Xml
0.07
,url
0.07
edited
0.06
topic
0.06
.aut
0.06
_href
0.06
共同
0.06
skin
0.06
ria
0.06
dera
0.06
Activations Density 0.008%