INDEX
Explanations
cause and effect
The neuron fires on phrases that describe problems, risks, or negative outcomes—i.e. mentions of associated impacts, consequences, or severity in a research‐style context.
New Auto-Interp
Negative Logits
bekan
-0.07
bew
-0.07
Interior
-0.06
benchmark
-0.06
/documentation
-0.06
:x
-0.06
Janet
-0.06
Required
-0.06
кв
-0.06
ropical
-0.06
POSITIVE LOGITS
...");↵↵
0.07
ridor
0.07
ुए
0.07
Increased
0.07
[])↵↵
0.07
();↵
0.06
PING
0.06
Friend
0.06
ediyor
0.06
alıyor
0.06
Activations Density 0.080%