INDEX
Explanations
conversational exchanges and dialogues
New Auto-Interp
Negative Logits
impactful
-0.75
tbh
-0.73
incentiv
-0.68
tasked
-0.66
idk
-0.66
emojis
-0.66
bestie
-0.66
relatable
-0.65
curated
-0.65
Idk
-0.65
POSITIVE LOGITS
faßt
0.75
muß
0.73
lousy
0.59
müßte
0.55
Schluß
0.54
biß
0.52
wuß
0.50
quelquefois
0.49
mußte
0.49
daß
0.47
Activations Density 0.996%