INDEX
Explanations
mentions of political or controversial topics
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
198
+0.10
0.3%
50
+0.10
0.3%
1763
+0.09
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1472
+0.10
0.04
16
+0.10
0.05
369
+0.09
0.04
Negative Logits
juges
-0.66
mistak
-0.61
vœux
-0.58
zijde
-0.58
modalités
-0.56
exemplaires
-0.53
prochaines
-0.53
évaluations
-0.50
thinkable
-0.50
relenting
-0.50
POSITIVE LOGITS
palab
0.74
fordable
0.72
felipe
0.69
palio
0.66
ñora
0.66
doman
0.66
hcm
0.65
laci
0.65
romero
0.64
juf
0.64
Activations Density 0.240%