INDEX
Explanations
references to personal pronouns and expressions of intent or desire
New Auto-Interp
Negative Logits
desiring
-0.73
Жела
-0.71
desired
-0.69
wishing
-0.67
wished
-0.63
Desired
-0.63
Wishing
-0.62
wish
-0.61
Wishing
-0.60
desired
-0.59
POSITIVE LOGITS
want
1.06
wan
0.70
wants
0.65
quiero
0.57
WAN
0.56
voulez
0.52
Wan
0.51
veut
0.50
quieren
0.49
Want
0.49
Activations Density 0.277%