INDEX
Explanations
phrases encouraging interaction with online platforms and subscriptions
New Auto-Interp
Head Attr Weights
0:0.05
1:0.02
2:0.09
3:0.15
4:0.12
5:0.03
6:0.04
7:0.12
8:0.11
9:0.06
10:0.06
11:0.10
Negative Logits
describ
-1.43
perce
-1.41
eway
-1.33
quer
-1.24
tram
-1.23
unin
-1.23
gorilla
-1.20
ability
-1.20
tending
-1.19
icans
-1.19
POSITIVE LOGITS
hetti
1.45
Alternatively
1.42
mods
1.38
�
1.31
etsy
1.31
odcast
1.30
Mini
1.28
attRot
1.27
Enjoy
1.27
Original
1.27
Activations Density 0.023%