INDEX
Explanations
phrases implying strong personal opinions or reflections
New Auto-Interp
Negative Logits
disadvant
-0.91
fortun
-0.71
princ
-0.71
obser
-0.69
psychiat
-0.69
undown
-0.68
vulner
-0.68
Seym
-0.67
fodder
-0.66
Palestin
-0.65
POSITIVE LOGITS
ï¸ı
1.21
âĻ
0.83
own
0.81
女
0.81
âĹ
0.80
ï¸
0.78
Ì
0.77
âĶĢâĶĢ
0.76
âĶĢâĶĢâĶĢâĶĢ
0.76
âĸł
0.75
Activations Density 0.230%