INDEX
Explanations
negations and expressions of doubt or uncertainty
New Auto-Interp
Negative Logits
arp
-0.15
amo
-0.15
withdraw
-0.14
outu
-0.14
ories
-0.14
ues
-0.14
inar
-0.14
hatt
-0.14
acid
-0.13
ary
-0.13
POSITIVE LOGITS
necessarily
0.28
matter
0.22
matter
0.20
mattered
0.19
ever
0.17
ecessarily
0.17
matters
0.16
Matter
0.16
compares
0.16
ylland
0.16
Activations Density 0.125%