INDEX
Explanations
instructions for performing actions or tasks
references to the reader's actions and experiences
New Auto-Interp
Negative Logits
ooters
-0.63
aughs
-0.63
abound
-0.62
recy
-0.62
acular
-0.62
perks
-0.61
unthinkable
-0.61
WAYS
-0.61
reperto
-0.60
eties
-0.59
POSITIVE LOGITS
intended
0.93
selected
0.91
umbn
0.91
originally
0.89
chosen
0.86
requested
0.84
assigned
0.82
desired
0.80
supposed
0.78
photographed
0.76
Activations Density 0.221%