INDEX
Explanations
phrases related to information dissemination or updates
phrases indicating familiarity or awareness about ongoing topics or discussions
New Auto-Interp
Negative Logits
Dialogue
-0.65
grades
-0.61
prolong
-0.59
sacrific
-0.58
openness
-0.57
preserves
-0.56
stunts
-0.56
differentiation
-0.55
downgrade
-0.55
stabilization
-0.55
POSITIVE LOGITS
guessed
1.14
noticed
1.13
familiar
1.12
heard
1.06
know
0.95
watched
0.93
know
0.92
acquainted
0.91
remember
0.88
already
0.88
Activations Density 0.249%