INDEX
Explanations
instances where the text advises skipping or jumping over certain content
instructions or suggestions to skip sections of text
New Auto-Interp
Negative Logits
amen
-0.77
lee
-0.73
orc
-0.72
lie
-0.69
lisher
-0.68
oran
-0.68
crim
-0.67
Reviewer
-0.67
rador
-0.67
eer
-0.65
POSITIVE LOGITS
altogether
0.87
ahead
0.77
breakfast
0.75
overboard
0.73
vacations
0.69
puberty
0.69
ichi
0.68
detection
0.67
bothering
0.67
meals
0.67
Activations Density 0.039%