INDEX
Explanations
The neuron fires on words denoting close familial roles (e.g. “mother,” “daughter”).
New Auto-Interp
Negative Logits
ARE
-0.07
_head
-0.07
graduate
-0.06
scor
-0.06
Johnson
-0.06
Sandwich
-0.06
الدول
-0.06
hè
-0.06
_cum
-0.06
apples
-0.06
POSITIVE LOGITS
Kunst
0.07
Familie
0.06
얼
0.06
%(
0.06
оказ
0.06
ा
0.06
eyi
0.06
KD
0.06
.onError
0.06
-Pro
0.06
Activations Density 0.013%