INDEX
Explanations
terms related to debates, viewpoints, and claims
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
198
+0.16
0.5%
872
+0.13
0.4%
791
+0.09
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
198
+0.16
0.07
791
+0.13
0.05
963
+0.09
0.05
Negative Logits
murano
-0.94
tupperware
-0.93
affor
-0.92
indestru
-0.91
oreo
-0.91
cushi
-0.90
snoopy
-0.90
strick
-0.89
nutella
-0.88
eiffel
-0.86
POSITIVE LOGITS
Argumento
0.68
claims
0.64
argument
0.63
often
0.62
arguments
0.60
argument
0.60
claimed
0.60
claim
0.60
discussions
0.60
injus
0.59
Activations Density 0.660%