INDEX
Explanations
comparisons of how different individuals treat others, focusing on terms like "comrades" and "servants."
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
198
+0.08
0.2%
1984
+0.07
0.2%
394
+0.07
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1986
+0.08
0.03
1129
+0.07
0.05
1261
+0.07
0.02
Negative Logits
Minang
-0.66
disagre
-0.66
tolerably
-0.62
nobly
-0.62
vainly
-0.61
impra
-0.61
Putih
-0.61
profuse
-0.61
imperfectly
-0.61
unspeak
-0.60
POSITIVE LOGITS
prostitu
0.67
divertimento
0.66
tutt
0.64
religione
0.63
teolog
0.62
meras
0.59
palio
0.59
rilass
0.58
fatte
0.58
parteci
0.58
Activations Density 0.374%