INDEX
Explanations
phrases related to destruction or harmful actions
New Auto-Interp
Negative Logits
iras
-0.18
esta
-0.17
.truth
-0.15
atal
-0.15
oom
-0.14
trá»Ŀi
-0.14
cot
-0.14
turb
-0.14
esar
-0.14
elier
-0.14
POSITIVE LOGITS
urgeon
0.16
essel
0.16
pta
0.15
vÄĽd
0.15
ienne
0.15
FileVersion
0.15
Lambert
0.14
zent
0.14
spb
0.14
RegexOptions
0.13
Activations Density 0.026%