INDEX
Explanations
words related to criticism or negative descriptions
New Auto-Interp
Head Attr Weights
0:0.08
1:0.03
2:0.31
3:0.09
4:0.15
5:0.04
6:0.02
7:0.02
8:0.05
9:0.06
10:0.05
11:0.02
Negative Logits
uddin
-1.38
yip
-1.34
john
-1.22
terness
-1.21
ollah
-1.19
htaking
-1.13
lik
-1.13
underestimate
-1.12
envelope
-1.12
agine
-1.11
POSITIVE LOGITS
GES
1.31
mach
1.27
ombat
1.26
ゼウス
1.25
Tire
1.23
pter
1.17
>>>>
1.16
cised
1.15
aband
1.15
将
1.13
Activations Density 0.003%