INDEX
Explanations
phrases related to accusations or claims of wrongdoing
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
50
+0.24
1.3%
1464
+0.10
0.5%
313
+0.09
0.5%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1895
+0.24
0.02
1464
+0.10
0.02
663
+0.09
0.02
Negative Logits
<bos>
-2.94
띄
-0.67
/***
-0.63
//<
-0.58
displayquote
-0.58
HasKey
-0.58
tw
-0.58
//{
-0.57
win
-0.57
-0.57
POSITIVE LOGITS
Juf
1.76
Khart
1.67
Minang
1.66
thut
1.60
fta
1.50
bandung
1.50
jaya
1.48
accla
1.48
aen
1.48
increa
1.46
Activations Density 0.045%