INDEX
Explanations
warning signs of potential harm or violence, especially in the context of social isolation and manipulation
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
468
+0.10
0.3%
1314
+0.09
0.3%
513
+0.07
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
468
+0.10
0.04
867
+0.09
0.05
836
+0.07
0.02
Negative Logits
libéral
-0.55
erenc
-0.54
ladiator
-0.54
souverain
-0.52
décret
-0.51
initComponents
-0.51
Portail
-0.50
dimenti
-0.49
Tembelea
-0.49
klere
-0.48
POSITIVE LOGITS
suspicious
0.87
suspicion
0.72
suspect
0.72
suspicions
0.68
suspected
0.64
sospe
0.63
arouse
0.62
unusual
0.61
anomalous
0.58
alerted
0.57
Activations Density 0.672%