INDEX
Explanations
the presence of common nouns or specific subjects related to a topic
New Auto-Interp
Head Attr Weights
0:0.08
1:0.09
2:0.06
3:0.09
4:0.07
5:0.09
6:0.09
7:0.08
8:0.07
9:0.08
10:0.08
11:0.07
Negative Logits
��
-2.20
icter
-2.16
aughs
-2.12
�
-2.01
�
-2.00
architect
-2.00
Symphony
-1.97
architects
-1.96
™
-1.92
Champion
-1.90
POSITIVE LOGITS
"}
2.71
\-
2.23
seiz
2.22
Reviewed
2.10
umen
2.06
hold
2.06
"}],"
2.03
recharge
1.99
"},"
1.98
ources
1.98
Activations Density 0.000%