INDEX
Explanations
information related to negative consequences or dangers
themes related to moral ambiguity and human nature
New Auto-Interp
Negative Logits
mentioning
-0.59
Pwr
-0.56
thats
-0.56
noting
-0.56
tho
-0.55
resear
-0.54
testament
-0.54
realise
-0.53
acknowledgement
-0.53
noted
-0.52
POSITIVE LOGITS
impunity
0.80
lessly
0.77
unlawfully
0.74
uously
0.69
arbitrarily
0.68
ocr
0.67
unfairly
0.65
inappropriately
0.64
itarian
0.63
versive
0.62
Activations Density 0.698%