INDEX
Explanations
phrases related to opinions, beliefs, claims, and speculations
New Auto-Interp
Negative Logits
ciating
-0.85
ients
-0.77
tesy
-0.70
ĸļ
-0.67
ften
-0.63
viron
-0.62
ibles
-0.62
rals
-0.60
Contents
-0.60
Himself
-0.59
POSITIVE LOGITS
parallels
0.77
errone
0.76
incorrectly
0.72
similarities
0.65
negatively
0.63
resemb
0.63
doom
0.63
why
0.62
whether
0.61
aloud
0.61
Activations Density 0.247%