INDEX
Explanations
phrases containing derogatory language and offensive remarks
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
198
+0.12
0.3%
538
+0.11
0.3%
1601
+0.10
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1601
+0.12
0.06
538
+0.11
0.05
1525
+0.10
0.04
Negative Logits
vogli
-0.74
rispond
-0.72
dimenti
-0.69
desideri
-0.67
trovo
-0.67
trovi
-0.64
credere
-0.64
auguri
-0.63
vedi
-0.63
voleva
-0.63
POSITIVE LOGITS
insults
0.67
insult
0.60
insulting
0.59
remarks
0.58
hurled
0.56
remark
0.53
verbally
0.53
derogatory
0.51
uttered
0.51
comments
0.50
Activations Density 0.435%