INDEX
Explanations
phrases related to directional movement or transition
instances of the word "into" and related context
New Auto-Interp
Negative Logits
icing
-0.64
due
-0.64
opinion
-0.62
sessions
-0.61
parties
-0.61
DD
-0.61
eval
-0.60
chat
-0.60
breakout
-0.60
Warm
-0.59
POSITIVE LOGITS
into
3.70
onto
1.01
Into
1.00
inside
0.97
hiba
0.82
lda
0.82
ever
0.82
indu
0.81
INTO
0.80
INT
0.79
Activations Density 0.009%