INDEX
Explanations
The neuron activates on the model’s self-referential first-person statements (e.g. “I’m,” “my,” “I’m your…”).
New Auto-Interp
Negative Logits
.geom
-0.07
chai
-0.06
GMO
-0.06
atıcı
-0.06
strcpy
-0.06
conclusive
-0.06
Dj
-0.06
_minimum
-0.06
Projectile
-0.06
Delayed
-0.06
POSITIVE LOGITS
feder
0.08
weg
0.07
SenderId
0.06
strerror
0.06
тот
0.06
clinically
0.06
.distance
0.06
redesigned
0.06
greater
0.06
Кам
0.06
Activations Density 0.011%