INDEX
Explanations
phrases indicating contrast or contradiction
phrases indicating temporal contexts or factual assertions
New Auto-Interp
Negative Logits
aspiration
-0.60
holdings
-0.58
unfocusedRange
-0.57
ucker
-0.56
Telegram
-0.56
qt
-0.56
subreddit
-0.55
legram
-0.55
Ludwig
-0.55
pherd
-0.55
POSITIVE LOGITS
ean
0.71
irlf
0.69
udeb
0.69
fil
0.67
ealous
0.64
})
0.63
©¶æ
0.62
azard
0.62
ķ
0.61
Spoiler
0.59
Activations Density 0.352%