INDEX
Explanations
The neuron primarily detects occurrences of the word “problem.”
New Auto-Interp
Negative Logits
engr
-0.07
Kir
-0.06
ent
-0.06
ecs
-0.06
contingent
-0.06
Evans
-0.06
Express
-0.06
courtesy
-0.06
.ke
-0.06
drives
-0.06
POSITIVE LOGITS
problem
0.17
problems
0.15
Problem
0.14
problem
0.12
Problems
0.12
Problem
0.11
problems
0.10
mma
0.09
迷
0.08
troub
0.08
Activations Density 0.049%