INDEX
Explanations
The neuron fires on nonzero activations for words that signal formal rulings or approvals (e.g. “approved,” “accepted,” “rejected,” “decided”).
New Auto-Interp
Negative Logits
guna
-0.06
_scale
-0.06
Erdoğan
-0.06
cánh
-0.06
creds
-0.06
pancakes
-0.06
Drinks
-0.06
Jiří
-0.06
худож
-0.06
rodin
-0.06
POSITIVE LOGITS
convention
0.07
GENERAL
0.07
》(
0.06
داخلی
0.06
department
0.06
labeled
0.06
.StylePriority
0.06
professional
0.06
usable
0.06
<IEnumerable
0.06
Activations Density 0.025%