INDEX
Explanations
phrases related to criticism and negative impacts on society
New Auto-Interp
Negative Logits
unya
-0.16
лег
-0.15
-runtime
-0.15
ToBounds
-0.14
StackNavigator
-0.14
ingu
-0.14
egin
-0.14
plex
-0.14
osis
-0.14
uzman
-0.14
POSITIVE LOGITS
stereotype
0.21
stereotypes
0.17
bad
0.17
stereo
0.16
representation
0.16
bad
0.15
Bad
0.15
æĻ´
0.15
tere
0.15
extremes
0.15
Activations Density 0.099%