INDEX
Explanations
instances of dialogue and spoken interactions
New Auto-Interp
Negative Logits
icrous
-0.15
λαν
-0.14
udd
-0.14
WXYZ
-0.14
933
-0.14
arda
-0.14
slee
-0.13
anner
-0.13
ãģµ
-0.13
xde
-0.13
POSITIVE LOGITS
instruction
0.45
instruct
0.44
instructions
0.40
warning
0.38
admon
0.37
warnings
0.36
instruction
0.35
advice
0.35
instructed
0.35
instr
0.33
Activations Density 0.764%