INDEX
Explanations
mentions of criminal activities or controversial events in news articles
New Auto-Interp
Negative Logits
Wander
-0.66
Horizons
-0.66
Leilan
-0.66
Paran
-0.65
Prol
-0.62
Seller
-0.61
Vert
-0.59
Remix
-0.59
Tsukuyomi
-0.59
Annotations
-0.58
POSITIVE LOGITS
rates
0.69
looph
0.64
categ
0.60
âĢİ
0.59
adolesc
0.57
ĵĺ
0.57
efficients
0.57
acting
0.56
Õ
0.55
urged
0.55
Activations Density 0.085%