INDEX
Explanations
isolate phrases related to historical events and individuals, particularly focusing on deception or corruption
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
604
+0.09
0.2%
198
+0.08
0.2%
1129
+0.08
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1129
+0.09
0.06
693
+0.08
0.04
649
+0.08
0.05
Negative Logits
inconce
-0.88
reluct
-0.81
unspeak
-0.80
snoopy
-0.79
disagre
-0.74
excru
-0.73
indescri
-0.71
horrend
-0.70
suspic
-0.68
sophistic
-0.68
POSITIVE LOGITS
fasi
0.72
merely
0.69
rilass
0.66
pronti
0.66
interessanti
0.65
soggior
0.65
scelte
0.64
sabato
0.64
vanta
0.64
frasi
0.63
Activations Density 0.497%