INDEX
Explanations
words that convey strong positive emotions or highlight exceptional qualities
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
118
+0.14
0.8%
56
+0.11
0.6%
148
+0.10
0.6%
Correlated Neurons
Index
P. Corr.
Cos Sim.
118
+0.14
0.07
148
+0.11
0.02
119
+0.10
0.03
Negative Logits
agra
-1.65
ainer
-1.64
ves
-1.57
ters
-1.57
heses
-1.56
tico
-1.55
tern
-1.53
Åij
-1.53
uelle
-1.53
ár
-1.53
POSITIVE LOGITS
µ
1.90
¨
1.82
ľĵ
1.78
®
1.73
amounts
1.72
Ī
1.72
Ł
1.70
č↵
1.68
↵
1.68
↵
1.68
Activations Density 0.481%