INDEX
Explanations
phrases related to admiration, respect, sympathy, message content, and judgment
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1013
+0.08
0.2%
588
+0.08
0.2%
1627
+0.07
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
284
+0.08
0.07
2030
+0.08
0.04
588
+0.07
0.06
Negative Logits
maksi
-0.96
territo
-0.95
uhr
-0.95
ivi
-0.94
embra
-0.93
accla
-0.91
dises
-0.89
saar
-0.88
sena
-0.88
endom
-0.87
POSITIVE LOGITS
TextAppearance
0.52
hasPermission
0.51
nodig
0.51
ffilm
0.50
Personensuche
0.49
styleType
0.48
beginnetje
0.48
بيها
0.48
LayoutConstraint
0.46
woordig
0.46
Activations Density 0.336%