INDEX
Explanations
suspicious activities in various scenarios
New Auto-Interp
Negative Logits
elsen
-0.83
ffen
-0.73
arium
-0.69
ĸļ
-0.68
ophon
-0.68
á
-0.68
taught
-0.66
agos
-0.66
apologise
-0.66
bourg
-0.66
POSITIVE LOGITS
ly
1.05
Activity
1.04
activity
1.00
Intent
0.88
motives
0.86
intent
0.85
behaviour
0.83
behavior
0.81
behaviours
0.78
icious
0.78
Activations Density 0.045%