INDEX
Explanations
words indicating contribution or involvement in various contexts
New Auto-Interp
Negative Logits
arp
-0.17
laces
-0.17
lace
-0.15
ifier
-0.15
ify
-0.15
oton
-0.15
arm
-0.15
lac
-0.15
el
-0.14
undy
-0.14
POSITIVE LOGITS
towards
0.23
toward
0.23
utory
0.20
Towards
0.19
Towards
0.18
uting
0.18
factors
0.17
icut
0.17
Tow
0.17
ally
0.16
Activations Density 0.024%