INDEX
Explanations
negative or problematic descriptors in various contexts
New Auto-Interp
Negative Logits
RITE
-0.17
owel
-0.16
çĽ
-0.15
ikut
-0.15
à¸ŀà¸Ń
-0.14
rani
-0.14
ê¹ĮìļĶ
-0.14
ottle
-0.14
somewhat
-0.14
rance
-0.13
POSITIVE LOGITS
nor
0.40
anymore
0.29
nor
0.28
Nor
0.24
Nor
0.23
neither
0.20
anywhere
0.20
NOR
0.19
sondern
0.19
ani
0.18
Activations Density 0.484%