INDEX
Explanations
phrases related to questioning or reporting suspicious or unethical behavior
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1967
+0.34
1.3%
1842
+0.32
1.2%
468
+0.16
0.6%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1967
+0.34
0.11
16
+0.32
0.12
1842
+0.16
0.09
Negative Logits
<bos>
-1.41
GEBURTSDATUM
-0.69
Personensuche
-0.68
дописавши
-0.62
مشين
-0.61
autorytatywna
-0.58
rungsseite
-0.57
hoeddwyd
-0.56
expandindo
-0.54
NameInMap
-0.54
POSITIVE LOGITS
fordable
1.03
Juf
0.99
Haci
0.97
loto
0.96
Darío
0.94
lele
0.93
Áng
0.93
santiago
0.93
pican
0.91
hcm
0.90
Activations Density 1.803%