INDEX
Explanations
harsh or critical language and statements in the text
New Auto-Interp
Negative Logits
ually
-0.81
duct
-0.74
iven
-0.73
elta
-0.72
ovember
-0.71
phis
-0.70
xxxxxxxx
-0.69
plex
-0.68
agara
-0.67
eer
-0.67
POSITIVE LOGITS
harshly
0.88
harsh
0.87
hars
0.86
harsher
0.83
reception
0.77
criticism
0.74
parting
0.73
opinions
0.71
criticisms
0.71
rhetoric
0.70
Activations Density 10.351%