INDEX
Explanations
phrases indicating motivation or influence behind actions
New Auto-Interp
Head Attr Weights
0:0.02
1:0.01
2:0.07
3:0.06
4:0.13
5:0.04
6:0.03
7:0.34
8:0.05
9:0.03
10:0.07
11:0.10
Negative Logits
jud
-1.53
Redditor
-1.52
fing
-1.49
pants
-1.45
�
-1.45
onyms
-1.41
ummer
-1.41
�
-1.41
nick
-1.39
dden
-1.39
POSITIVE LOGITS
andise
1.64
conver
1.50
developments
1.50
obs
1.48
wedge
1.48
dwar
1.42
Wer
1.42
funnel
1.40
consolidation
1.38
movements
1.36
Activations Density 0.001%