INDEX
Explanations
AI-generated content
The neuron activates on phrases describing a language model’s ability to generate human‐like or indistinguishable text.
New Auto-Interp
Negative Logits
ADVISED
-0.06
mužů
-0.06
Wes
-0.06
Tus
-0.06
modules
-0.06
false
-0.06
ted
-0.06
mapa
-0.06
/Button
-0.06
dày
-0.06
POSITIVE LOGITS
확실
0.07
giveaways
0.07
undeniable
0.06
े,
0.06
']).
0.06
OCUMENT
0.06
.SERVER
0.06
鮮
0.06
/admin
0.06
strtok
0.06
Activations Density 0.032%