INDEX
Explanations
code/programming
This neuron activates on the “y” string—specifically on the word “concerning” and the following gerund/verb tokens describing harmful or malicious actions.
New Auto-Interp
Negative Logits
FIT
-0.06
_filt
-0.06
027
-0.06
isEnabled
-0.06
prt
-0.06
唯一
-0.06
숨
-0.06
passes
-0.06
senses
-0.06
ceipt
-0.06
POSITIVE LOGITS
�
0.07
978
0.06
simultaneously
0.06
адміністра
0.06
bunların
0.06
Residential
0.06
Alg
0.06
########.
0.06
stirring
0.06
ського
0.06
Activations Density 0.006%