INDEX
Explanations
positive adjectives
expressions of positive evaluations or praise
New Auto-Interp
Negative Logits
eters
-0.81
ople
-0.80
iper
-0.73
eter
-0.73
istan
-0.72
pper
-0.71
hyde
-0.70
hip
-0.70
hod
-0.70
Pavilion
-0.69
POSITIVE LOGITS
enough
1.35
enough
1.09
reads
1.02
luck
1.00
sword
0.96
Enough
0.92
intentions
0.91
Samar
0.91
ol
0.88
luck
0.86
Activations Density 0.064%