INDEX
Explanations
phrases related to human relationships and social dynamics
New Auto-Interp
Head Attr Weights
0:0.02
1:0.01
2:0.18
3:0.29
4:0.06
5:0.05
6:0.05
7:0.06
8:0.04
9:0.05
10:0.07
11:0.08
Negative Logits
EVER
-1.68
"]=>
-1.49
'.
-1.42
sha
-1.41
date
-1.37
badge
-1.36
letters
-1.35
');
-1.35
Ever
-1.34
»
-1.33
POSITIVE LOGITS
aeper
1.65
etheless
1.64
�
1.58
unmist
1.53
trak
1.51
indis
1.49
adle
1.48
challeng
1.47
alore
1.47
anke
1.46
Activations Density 0.004%