INDEX
Explanations
phrases indicating certainty or strong belief
conversational phrases expressing opinions or reactions
New Auto-Interp
Negative Logits
İĭ
-0.76
ership
-0.73
Presence
-0.70
é¾įå¥ij士
-0.70
Contents
-0.69
assembly
-0.65
-+-+
-0.63
Rated
-0.63
Assist
-0.62
ãĤ¢ãĥ«
-0.62
POSITIVE LOGITS
forgot
0.95
sounds
0.81
sounded
0.80
kinda
0.79
sucks
0.78
kidding
0.78
spelled
0.75
fooled
0.75
tricked
0.71
got
0.71
Activations Density 0.223%