INDEX
Explanations
tokens that are part of user instructions or explicit task/request prompts (i.e., directive phrases asking the model to do something).
New Auto-Interp
Negative Logits
to
0.53
of
0.52
auf
0.48
ă
0.48
från
0.48
ão
0.46
een
0.46
على
0.45
with
0.43
của
0.43
POSITIVE LOGITS
6
0.44
ის
0.43
Dave
0.43
7
0.41
ございます
0.41
5
0.40
티
0.39
ке
0.39
정에
0.38
ג
0.38
Activations Density 12.205%