INDEX
Explanations
commands or requests for information
commands or prompts for sharing information
New Auto-Interp
Negative Logits
rane
-0.75
ILCS
-0.74
urdue
-0.74
cdn
-0.73
imates
-0.71
hered
-0.66
rior
-0.65
erala
-0.65
Antiqu
-0.63
Dub
-0.61
POSITIVE LOGITS
tale
1.46
ingly
1.18
tell
0.93
us
0.77
ariat
0.76
tell
0.75
him
0.74
roy
0.72
me
0.72
Tell
0.71
Activations Density 0.034%