INDEX
Explanations
The main thing this neuron does is find phrases related to negative consequences or issues
words related to problems or challenges
New Auto-Interp
Negative Logits
classy
-0.77
excel
-0.76
gifted
-0.74
cultured
-0.73
sublime
-0.71
proudly
-0.71
supreme
-0.70
orally
-0.70
eleg
-0.70
fictional
-0.70
POSITIVE LOGITS
ruption
1.25
urrence
1.16
activation
1.15
issions
1.12
aution
1.12
downs
1.12
rification
1.09
illation
1.07
amping
1.05
gradation
1.04
Activations Density 0.395%