INDEX
Explanations
phrases related to beliefs, descriptions, and assumptions about certain topics or entities
phrases that express common beliefs or perceptions
New Auto-Interp
Negative Logits
aleb
-0.62
Donkey
-0.60
disg
-0.60
Hungry
-0.60
Bravo
-0.58
Kl
-0.58
alos
-0.57
VG
-0.57
Competition
-0.56
Sierra
-0.56
POSITIVE LOGITS
isSpecialOrderable
0.78
ت
0.75
pegged
0.71
س
0.70
hack
0.68
inet
0.68
ypes
0.67
idable
0.66
à¨
0.66
derog
0.66
Activations Density 0.159%