INDEX
Explanations
phrases indicating conditional scenarios or potential outcomes
New Auto-Interp
Negative Logits
cio
-0.15
rix
-0.15
shared
-0.15
elt
-0.15
ide
-0.14
rx
-0.14
rik
-0.14
eldon
-0.14
bach
-0.13
infl
-0.13
POSITIVE LOGITS
odyn
0.17
lesc
0.16
available
0.15
steller
0.15
_easy
0.15
offer
0.15
available
0.15
mpar
0.15
ToDevice
0.14
offers
0.14
Activations Density 0.013%