INDEX
Explanations
positive descriptive adjectives
terms associated with niceness or positive attributes
New Auto-Interp
Negative Logits
WIND
-0.96
AUT
-0.82
ENG
-0.74
Ultra
-0.73
Printed
-0.70
produced
-0.69
âĵĺ
-0.68
GC
-0.67
Mandatory
-0.66
Continued
-0.66
POSITIVE LOGITS
nic
1.46
eties
1.38
uity
0.95
uously
0.94
atural
0.94
esse
0.93
otin
0.91
eteenth
0.89
ciating
0.89
ety
0.88
Activations Density 0.005%