INDEX
Explanations
references to past actions or positions
the word "previously" and its variations in context
New Auto-Interp
Negative Logits
eer
-0.69
ribution
-0.69
letico
-0.66
ocracy
-0.65
pling
-0.65
ging
-0.64
Redditor
-0.64
roller
-0.64
Incre
-0.64
lua
-0.62
POSITIVE LOGITS
unsus
1.00
unpublished
0.88
unsuccessfully
0.82
undisclosed
0.81
disclosed
0.81
incarcerated
0.79
experimented
0.77
held
0.76
teased
0.76
belonged
0.76
Activations Density 0.040%