INDEX
Explanations
specific named entities related to politics and medical terms
references to identities and geographic locations associated with societal structures
New Auto-Interp
Negative Logits
eering
-0.74
aku
-0.67
stakes
-0.65
ing
-0.65
atility
-0.64
Kov
-0.62
Antar
-0.62
dogs
-0.62
enthal
-0.62
agre
-0.61
POSITIVE LOGITS
itate
0.89
illin
0.81
forth
0.77
alyst
0.76
sei
0.76
itation
0.75
itated
0.75
onymous
0.74
herent
0.74
esville
0.73
Activations Density 0.051%