INDEX
Explanations
references to political affiliations or actions related to defiance against authority
New Auto-Interp
Negative Logits
CWE
-0.79
gainera
-0.71
Wicidata
-0.71
cherchés
-0.67
disambiguazione
-0.66
Италијани
-0.63
EndGlobalSection
-0.63
كويكب
-0.62
Vidite
-0.59
oprot
-0.54
POSITIVE LOGITS
refusal
2.05
refused
1.90
refus
1.89
refuse
1.85
refusing
1.79
refuses
1.72
rejection
1.69
reject
1.63
rejecting
1.62
refuser
1.61
Activations Density 0.746%