INDEX
Explanations
instability
The neuron flags words expressing instability or unpredictability (e.g. “chaotic,” “instability,” “unstable”).
New Auto-Interp
Negative Logits
doctoral
-0.07
word
-0.07
gor
-0.07
pow
-0.07
医疗
-0.06
-highlight
-0.06
parole
-0.06
Howe
-0.06
прос
-0.06
Numeric
-0.06
POSITIVE LOGITS
instability
0.11
unstable
0.10
destabil
0.09
št
0.08
ACCESS
0.08
insecurity
0.07
inst
0.07
unreliable
0.07
BAL
0.07
annon
0.07
Activations Density 0.007%