INDEX
Explanations
adjectives and phrases related to negative or dangerous situations
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
674
+0.21
0.7%
1967
+0.14
0.5%
1870
+0.11
0.4%
Correlated Neurons
Index
P. Corr.
Cos Sim.
198
+0.21
0.07
74
+0.14
0.04
678
+0.11
0.05
Negative Logits
increa
-3.57
disagre
-3.54
reluct
-3.54
impra
-3.48
affor
-3.46
unspeak
-3.33
inev
-3.32
depic
-3.26
gaily
-3.25
strick
-3.23
POSITIVE LOGITS
<bos>
1.63
items
0.86
activities
0.85
settings
0.84
issues
0.84
situations
0.83
projects
0.83
options
0.82
elements
0.82
circumstances
0.82
Activations Density 0.472%