INDEX
Explanations
Human behavior
This neuron activates on the prompt’s “behavior” indicator, i.e. the token introducing the specific behavior to evaluate.
New Auto-Interp
Negative Logits
Seconds
-0.07
↵
-0.07
('.')↵-0.07
care
-0.06
bilingual
-0.06
puty
-0.06
"}";↵
-0.06
.ud
-0.06
uard
-0.06
On
-0.06
POSITIVE LOGITS
avantaj
0.07
abilidade
0.07
("/{0.06
/theme
0.06
деятельности
0.06
libr
0.06
نسخ
0.06
โรงแรม
0.06
Hệ
0.06
_mail
0.06
Activations Density 0.002%