INDEX
Explanations
words related to correct and incorrect judgments or evaluations
phrases or statements that reflect disbelief or discomfort
New Auto-Interp
Negative Logits
hement
-0.88
abouts
-0.86
nesday
-0.79
stad
-0.78
earch
-0.77
ortun
-0.72
isons
-0.72
herself
-0.71
intended
-0.70
osponsors
-0.70
POSITIVE LOGITS
ONSORED
0.87
Seriously
0.80
Unless
0.80
Liter
0.77
WHERE
0.77
HAHAHAHA
0.75
Okay
0.75
================================================================
0.73
Anyway
0.73
Domin
0.73
Activations Density 0.396%