INDEX
Explanations
phrases indicating confusion or uncertainty
references to a lack of knowledge or uncertainty
New Auto-Interp
Negative Logits
iki
-0.74
inka
-0.71
Reviewed
-0.68
ouri
-0.67
inos
-0.66
Pers
-0.64
istant
-0.63
ansk
-0.63
conn
-0.62
visor
-0.60
POSITIVE LOGITS
whatsoever
0.98
how
0.97
why
0.95
whats
0.82
what
0.80
whence
0.80
ledged
0.79
why
0.77
squat
0.75
WHY
0.74
Activations Density 0.036%