INDEX
Explanations
phrases and questions that express curiosity or seek information
New Auto-Interp
Head Attr Weights
0:0.01
1:0.01
2:0.08
3:0.18
4:0.08
5:0.03
6:0.07
7:0.06
8:0.05
9:0.05
10:0.17
11:0.16
Negative Logits
onomy
-1.60
estead
-1.49
hift
-1.47
pherd
-1.42
ş
-1.39
incial
-1.38
resident
-1.38
cult
-1.35
gd
-1.34
ploy
-1.34
POSITIVE LOGITS
cracked
1.39
curing
1.32
!).
1.28
)."
1.28
Sponsor
1.27
Please
1.25
expired
1.25
guessed
1.25
Schwarzenegger
1.23
Comedy
1.21
Activations Density 0.001%