INDEX
Explanations
phrases related to physical descriptions or conditions, especially ones related to clothing or appearance
New Auto-Interp
Negative Logits
Reviewer
-0.75
Hasan
-0.64
rers
-0.62
vation
-0.61
indirectly
-0.59
ãĥ¼ãĥĨ
-0.58
heads
-0.58
Hamm
-0.56
Scotia
-0.55
exit
-0.54
POSITIVE LOGITS
poke
1.42
iege
1.33
erker
1.24
pect
1.13
earchers
1.01
peak
0.95
aved
0.95
oin
0.94
erk
0.93
hirt
0.90
Activations Density 0.033%