INDEX
Explanations
words related to social interactions and engagement
New Auto-Interp
Head Attr Weights
0:0.01
1:0.01
2:0.09
3:0.07
4:0.17
5:0.03
6:0.08
7:0.31
8:0.03
9:0.03
10:0.06
11:0.04
Negative Logits
agos
-2.09
rompt
-1.91
ascript
-1.89
ebted
-1.81
iseum
-1.79
uliffe
-1.74
ancial
-1.73
umar
-1.70
entanyl
-1.67
aunder
-1.61
POSITIVE LOGITS
horizont
1.82
thick
1.58
circles
1.58
Lego
1.55
pics
1.54
pole
1.51
hither
1.50
LINE
1.50
thicker
1.49
Moe
1.47
Activations Density 0.001%