INDEX
Explanations
phrases indicating trust or belief
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
50
+0.18
0.7%
1473
+0.10
0.4%
605
+0.09
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1473
+0.18
0.05
446
+0.10
0.04
1029
+0.09
0.04
Negative Logits
<bos>
-1.71
-0.90
ⓧ
-0.86
<?
-0.84
/***
-0.79
<?
-0.79
Fuckin
-0.75
FTFY
-0.74
“…”
-0.72
/*
-0.71
POSITIVE LOGITS
Minang
0.93
bandung
0.89
thuy
0.82
baya
0.79
marea
0.79
Désolé
0.77
jaya
0.76
ados
0.75
bayern
0.75
embodi
0.74
Activations Density 0.258%