INDEX
Explanations
The neuron fires on occurrences of alignment‐related keywords (e.g. “aligned,” “alignment,” etc.) in the code.
New Auto-Interp
Negative Logits
检测
-0.07
_tel
-0.07
,new
-0.07
tcb
-0.06
Null
-0.06
-0.06
Happiness
-0.06
interceptor
-0.06
peace
-0.06
폴
-0.06
POSITIVE LOGITS
із
0.06
یمی
0.06
grâce
0.06
Adam
0.06
پر
0.06
柏
0.06
أبي
0.06
illions
0.06
抗
0.06
ابت
0.06
Activations Density 0.001%