INDEX
Explanations
phrases related to controversial or potentially harmful topics, such as racially motivated attacks, compromised identity keys, health concerns linked to pesticides, and claims of false statements
references to social and legal issues, particularly those involving crime, politics, and public sentiment
New Auto-Interp
Negative Logits
Darling
-0.52
Jr
-0.52
concess
-0.49
mbuds
-0.47
Kop
-0.47
overe
-0.46
retty
-0.46
sit
-0.46
educ
-0.46
advert
-0.46
POSITIVE LOGITS
)?
0.94
¶
0.84
Belfast
0.79
):
0.76
constitutes
0.75
violates
0.75
.--
0.75
.–
0.72
Copyright
0.72
"?
0.72
Activations Density 1.375%