INDEX
Explanations
response
The neuron activates on occurrences of the words “response” or “reply.”
New Auto-Interp
Negative Logits
15
-0.08
lit
-0.07
75
-0.07
275
-0.07
14
-0.07
11
-0.07
50
-0.07
Meat
-0.07
beaten
-0.07
Hall
-0.07
POSITIVE LOGITS
response
0.13
Response
0.10
responsive
0.10
responses
0.10
response
0.10
-response
0.10
responsive
0.09
Responses
0.09
Response
0.09
респ
0.09
Activations Density 0.083%