INDEX
Explanations
can you perform action
polite second-person questions asking the assistant to perform an action.
New Auto-Interp
Negative Logits
ل
0.54
ডি
0.50
栥
0.50
م
0.50
ر
0.49
ないと
0.47
有两个
0.47
двох
0.46
لی
0.46
ர
0.46
POSITIVE LOGITS
?),
0.52
...?
0.50
…?
0.49
ker
0.46
एखा
0.44
подума
0.43
?"
0.42
uuuu
0.42
gladly
0.42
unwittingly
0.41
Activations Density 0.075%