INDEX
Explanations
This neuron activates on mentions of adults (i.e. the token “adult”/“adults”).
New Auto-Interp
Negative Logits
아이
-0.07
mites
-0.07
Stop
-0.06
addafi
-0.06
IDS
-0.06
tup
-0.06
porter
-0.06
Casa
-0.06
arter
-0.06
Wis
-0.06
POSITIVE LOGITS
disclosing
0.07
戏
0.07
проп
0.07
همیشه
0.06
.script
0.06
setq
0.06
のが
0.06
津
0.06
theme
0.06
��제
0.06
Activations Density 0.012%