INDEX
Explanations
phrases indicating contrast or contradiction
phrases that contain negative statements or contradictions
New Auto-Interp
Negative Logits
ebted
-0.64
icka
-0.61
arius
-0.61
ierce
-0.59
watches
-0.57
ukong
-0.56
miah
-0.55
rupt
-0.55
itialized
-0.54
sense
-0.54
POSITIVE LOGITS
irrelevant
0.95
©¶æ
0.92
outwe
0.89
unlikely
0.86
hardly
0.83
beside
0.82
merely
0.80
livion
0.77
peanuts
0.76
moot
0.75
Activations Density 0.131%