INDEX
Explanations
threatening language and accusations in a confrontational context
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
674
+0.11
0.3%
604
+0.10
0.3%
946
+0.10
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
972
+0.11
0.04
946
+0.10
0.04
1478
+0.10
0.04
Negative Logits
?...
-1.24
fuf
-1.22
emphat
-1.19
reluct
-1.16
fta
-1.16
inev
-1.14
strick
-1.14
gend
-1.12
guarante
-1.11
fte
-1.11
POSITIVE LOGITS
***!
0.60
avril
0.59
testify
0.56
YOUR
0.54
YOU
0.54
oredCriteria
0.53
your
0.53
please
0.52
your
0.52
you
0.52
Activations Density 0.505%