INDEX
Explanations
words related to attacks or negative events, particularly those involving physical harm
terms related to 'ter' and written works or documents
New Auto-Interp
Negative Logits
overs
-0.69
OWN
-0.65
upid
-0.63
ushi
-0.62
ooth
-0.62
raction
-0.61
EGA
-0.61
icably
-0.61
Shiv
-0.60
UNE
-0.58
POSITIVE LOGITS
mic
1.00
pher
0.93
ping
0.91
borgh
0.89
mes
0.82
ciating
0.76
pers
0.76
gins
0.76
thing
0.75
rior
0.75
Activations Density 0.100%