INDEX
Explanations
It seems like the neuron is mainly looking for words related to negation
the word "not" and its variations in context
New Auto-Interp
Negative Logits
stakes
-0.71
Circuit
-0.71
itor
-0.70
Spotlight
-0.69
Tycoon
-0.67
Pros
-0.67
Contrast
-0.65
Comparison
-0.65
Expansion
-0.65
Handbook
-0.64
POSITIVE LOGITS
icably
1.39
epad
1.20
icable
1.15
necessarily
1.10
hin
1.04
orious
0.95
ched
0.92
withstanding
0.89
yet
0.85
ifications
0.83
Activations Density 0.171%