INDEX
Explanations
phrases related to controversy or conflict, particularly around online harassment
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
2034
+0.23
0.7%
382
+0.15
0.5%
1535
+0.15
0.5%
Correlated Neurons
Index
P. Corr.
Cos Sim.
382
+0.23
0.10
1535
+0.15
0.07
1200
+0.15
0.07
Negative Logits
hairc
-1.37
fuf
-1.36
scrat
-1.30
increa
-1.28
sappi
-1.27
guarante
-1.27
chrysler
-1.25
emphat
-1.24
unve
-1.24
maneu
-1.23
POSITIVE LOGITS
Instead
0.85
They
0.84
Specifically
0.75
***!
0.74
Firstly
0.71
They
0.70
He
0.70
After
0.69
])):
0.69
Instead
0.69
Activations Density 0.596%