INDEX
Explanations
phrases related to punishments or consequences
New Auto-Interp
Negative Logits
nces
-0.77
unaccompanied
-0.74
xual
-0.73
IUM
-0.69
ItemImage
-0.67
iances
-0.63
ibles
-0.62
Dian
-0.62
owan
-0.61
confidentiality
-0.61
POSITIVE LOGITS
forehead
0.87
shoulder
0.87
cheek
0.86
heels
0.76
toe
0.71
crown
0.71
neck
0.71
shoulders
0.69
os
0.68
chest
0.68
Activations Density 0.120%