INDEX
Explanations
traits or behaviors that might be considered negative or off-putting in social situations
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
876
+0.09
0.2%
271
+0.07
0.2%
378
+0.07
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
378
+0.09
0.04
208
+0.07
0.06
438
+0.07
0.06
Negative Logits
congrès
-0.68
Occidente
-0.67
Când
-0.67
Ră
-0.65
Ibidem
-0.65
villaggio
-0.64
Rumania
-0.62
dénon
-0.62
Alcalde
-0.62
Rois
-0.61
POSITIVE LOGITS
bother
0.66
anything
0.66
anymore
0.63
necessarily
0.63
anything
0.59
<bos>
0.59
any
0.59
bothered
0.58
worry
0.56
affect
0.55
Activations Density 0.443%