INDEX
Explanations
The neuron activates on the word “types,” signaling list or categorization cues in the text.
New Auto-Interp
Negative Logits
of
-0.06
сю
-0.06
=b
-0.06
نامه
-0.06
_energy
-0.06
app
-0.06
งส
-0.06
-the
-0.06
їй
-0.06
tha
-0.06
POSITIVE LOGITS
type
0.11
kinds
0.11
kind
0.11
types
0.10
sorts
0.10
tipo
0.08
Types
0.08
-types
0.08
Tipo
0.08
sort
0.07
Activations Density 0.036%