INDEX
Explanations
expressions of uncertainty or lack of knowledge
New Auto-Interp
Negative Logits
simply
0.77
bothered
0.74
merely
0.70
scratched
0.64
只需要
0.63
transcends
0.63
bothers
0.62
Majority
0.62
atie
0.61
лись
0.60
POSITIVE LOGITS
creo
0.80
నేను
0.79
particularmente
0.75
我认为
0.75
我不
0.75
Tôi
0.74
behaupt
0.74
вижу
0.73
знаю
0.71
હું
0.70
Activations Density 0.071%