INDEX
Explanations
asking questions to elicit information
New Auto-Interp
Negative Logits
obeys
0.76
изменение
0.75
geschrieben
0.73
czeniu
0.73
zerstört
0.72
を変更
0.71
写的
0.71
Nachricht
0.71
និយាយ
0.70
написано
0.70
POSITIVE LOGITS
elicit
1.36
solicit
1.33
probing
1.32
probe
1.31
probes
1.27
elic
1.26
soliciting
1.26
sparking
1.24
prompting
1.24
gauge
1.20
Activations Density 0.596%