INDEX
Explanations
instructions or commands to follow
instances of the word "follow."
New Auto-Interp
Negative Logits
pite
-0.82
inese
-0.78
Newsletter
-0.71
ãĥĨãĤ£
-0.71
cci
-0.69
risome
-0.68
Scotia
-0.68
inished
-0.66
ILCS
-0.66
urrection
-0.64
POSITIVE LOGITS
directions
0.86
closely
0.80
follow
0.75
suit
0.75
ansen
0.74
itored
0.73
blindly
0.71
Follow
0.71
behav
0.69
obedient
0.68
Activations Density 0.036%