INDEX
Explanations
phrases related to allegations or accusations of wrongdoing
references to claims or statements of wrongdoing or misconduct
New Auto-Interp
Negative Logits
ger
-0.68
xual
-0.67
ilation
-0.66
atu
-0.66
bern
-0.65
ament
-0.65
liv
-0.65
ature
-0.64
focus
-0.63
heses
-0.63
POSITIVE LOGITS
violated
0.76
misrepresent
0.75
allegedly
0.73
metic
0.73
infringing
0.72
contradict
0.72
Buyable
0.71
æ©
0.71
originated
0.71
infring
0.71
Activations Density 0.006%