INDEX
Explanations
words related to medical conditions, especially the adjective "sick" with varying intensities
references to the word "sick" in various contexts
New Auto-Interp
Negative Logits
unlaw
-0.78
rul
-0.71
compr
-0.69
merce
-0.67
unden
-0.65
sanctioned
-0.65
Unch
-0.65
ETHOD
-0.63
principals
-0.61
NPR
-0.61
POSITIVE LOGITS
ening
1.35
ened
1.26
bay
1.16
nesses
0.98
er
0.93
ly
0.92
ness
0.90
le
0.88
ert
0.87
igan
0.87
Activations Density 0.021%