INDEX
Explanations
negations and expressions of personal opinions or feelings
New Auto-Interp
Negative Logits
erval
-0.17
oun
-0.15
ouz
-0.15
eniable
-0.14
adden
-0.14
osaur
-0.14
ously
-0.14
Král
-0.13
uin
-0.13
anel
-0.13
POSITIVE LOGITS
who
0.31
who
0.28
beg
0.23
Who
0.22
Who
0.21
shr
0.20
Meh
0.20
shrugged
0.19
qui
0.19
thems
0.18
Activations Density 0.221%