INDEX
Explanations
phrases related to evaluations and judgments about societal norms and personal growth
New Auto-Interp
Negative Logits
iversit
-0.18
urgeon
-0.18
ÏĢη
-0.17
енка
-0.15
weakest
-0.14
kker
-0.14
aviest
-0.14
/extensions
-0.14
oler
-0.14
ardless
-0.13
POSITIVE LOGITS
simply
0.17
ken
0.15
gone
0.14
Simply
0.14
ues
0.14
Mim
0.14
ç
0.14
δÏħ
0.14
auer
0.14
entes
0.13
Activations Density 0.307%