INDEX
Explanations
negative statements or contradictions
negative phrases and rejections of concepts or narratives
New Auto-Interp
Negative Logits
itas
-0.76
kees
-0.74
Fs
-0.73
unders
-0.68
units
-0.67
})
-0.67
ãĤ¼ãĤ¦ãĤ¹
-0.66
WAY
-0.66
çļ
-0.65
anders
-0.65
POSITIVE LOGITS
necessarily
1.29
merely
1.08
mere
0.93
withstanding
0.91
meant
0.84
flashy
0.83
uncommon
0.79
solely
0.79
rocket
0.79
kidding
0.77
Activations Density 0.170%