INDEX
Explanations
phrases indicating critiques or negative assessments of systems and organizations
New Auto-Interp
Negative Logits
Marker
-0.17
رد
-0.16
sh
-0.16
gons
-0.15
Burnett
-0.15
ot
-0.14
wel
-0.14
AUTH
-0.14
ertos
-0.14
DÄĽ
-0.14
POSITIVE LOGITS
undo
0.20
eday
0.17
344
0.15
164
0.15
arse
0.15
jsonp
0.15
å±Ģ
0.14
karak
0.14
/modal
0.14
zek
0.14
Activations Density 0.103%