INDEX
Explanations
phrases indicating future actions or intentions
New Auto-Interp
Negative Logits
never
-0.23
never
-0.22
NEVER
-0.19
nunca
-0.18
Never
-0.17
никогда
-0.16
already
-0.15
ral
-0.15
already
-0.15
-0.15
POSITIVE LOGITS
need
0.24
hell
0.24
town
0.21
bed
0.21
need
0.21
Hell
0.21
be
0.21
iams
0.20
jail
0.20
hell
0.19
Activations Density 0.037%