INDEX
Explanations
phrases indicating communication or information exchange
New Auto-Interp
Head Attr Weights
0:0.09
1:0.02
2:0.06
3:0.05
4:0.04
5:0.04
6:0.26
7:0.04
8:0.06
9:0.23
10:0.03
11:0.03
Negative Logits
Metall
-3.65
robber
-3.56
cyan
-3.48
Nurs
-3.47
Barn
-3.45
Barnes
-3.43
chees
-3.43
Tes
-3.41
Sed
-3.38
Cena
-3.37
POSITIVE LOGITS
FP
9.99
FP
8.95
fp
6.95
FK
3.96
NF
3.89
Flo
3.83
FI
3.76
POV
3.65
TP
3.64
TP
3.61
Activations Density 0.001%