INDEX
Explanations
This neuron activates on words and phrases that signal explanation, reasoning, or justification (e.g., logic, decision, reason, analysis).
language expressing evaluation or judgment (opinionated/editorial statements).
language that critiques or questions decisions and actions, especially highlighting terms about logic, mistakes, errors, and controversial choices.
New Auto-Interp
Negative Logits
Pool
-0.07
sex
-0.07
recio
-0.07
众
-0.06
composition
-0.06
Responsibility
-0.06
é
-0.06
.memory
-0.06
shop
-0.06
tubes
-0.06
POSITIVE LOGITS
徒歩
0.07
halted
0.07
øns
0.07
theoret
0.06
kvin
0.06
�
0.06
,"
0.06
Inspector
0.06
runoff
0.06
CONSTANT
0.06
Activations Density 0.084%