INDEX
Explanations
language related to controversial or sensitive topics, as well as terms related to legal and ethical issues
topics related to whistleblowing, legal issues, and social controversies
New Auto-Interp
Negative Logits
utterstock
-0.55
ozy
-0.54
ipel
-0.53
asma
-0.53
inis
-0.51
bilt
-0.51
amiya
-0.50
ramid
-0.49
outube
-0.48
ibaba
-0.48
POSITIVE LOGITS
exists
0.68
existed
0.66
might
0.64
should
0.64
cannot
0.63
hadn
0.62
could
0.62
lacks
0.62
ought
0.59
lacked
0.57
Activations Density 1.222%