INDEX
Explanations
the word "when" followed by a numerical value
New Auto-Interp
Negative Logits
enegger
-0.74
kaya
-0.71
gur
-0.69
zzi
-0.68
yi
-0.66
feature
-0.65
edly
-0.62
hid
-0.61
hire
-0.61
chens
-0.61
POSITIVE LOGITS
soever
1.04
exactly
0.95
abouts
0.79
ce
0.79
irlf
0.78
they
0.73
someone
0.71
puberty
0.65
we
0.65
faced
0.65
Activations Density 0.078%