Mäx
maxmynter.bsky.social
Mäx
@maxmynter.bsky.social
Numbers, data, computer, math, ML, and sociocultural Meta-commentary.

In previous lifes: sociologist, physicist.

i shitpost as @maxmynter on the other app — but here I want to experiment with being more serious

maxmynter.com
Is the concentration of money and power in AI (or generally tech) a problem? Yes.

But these datasets democratize development.
And we also don’t hate on conveyor belt workers in the auto industries just because the big cronies pocket the profits.

The librarian is the wrong target.
November 30, 2024 at 12:06 AM
I demand deletion of this dataset as you have not obtained the consent of the posts author, triple notarized in presence of their legal guardian, a state lawyer, Ayn Rand, and god. Plus a declaration in lieu of oath that they will not revoke this consent.

I am shook to the core about your audacity
November 29, 2024 at 10:42 AM
I mean things change once you talk about commercial use. But the collection as such is fine if you comply with GDPR stuff about PII in the EU.
November 29, 2024 at 9:35 AM
The problem here is PII means everything that can identify a person. So would include the posts itself if i can search them on Bsky to find the author.

(Not my personal opinion, but the law if you go by scripture — so it’s insecure to use for Research).
November 28, 2024 at 9:21 PM
Generally complexity of concepts scales with deprh and knowledge mass with width.

So it’s possible the smaller model is a bit worse (and they just advertise it bc. It has the biggest margin).

Another reason could be reproducibility for research.

But idk, tbh.
November 28, 2024 at 9:18 PM
Distillation probably.

They use the outputs of a big model as targets for a smaller. Thus you can make it behave the same way but with fewer parameters and thus lower inference cost.
November 28, 2024 at 9:02 PM
Personally, i think people should stfu about data that they published to the world to be scraped.

But in GDPR everything that can be used to identify counts as PII. That includes pseudonyms and even the post itself if you can search for it to identify the author...
November 28, 2024 at 8:58 PM
I don’t think that consent makes the data worse.

There are other use-cases in the direction of computational sociology where consent is trickier.

Yet I do believe in the right for privacy. But i find it peculiar to blame someone for compiling posts that were made publicly for the world to see.
November 28, 2024 at 8:38 PM
I think i mainly just disagree. But it's fine. I'll just take your ad hominems.

I don't think that the Librarian did anything bad. I also don't think that is what many here took issue with.

They hate what they call "Tech bro's" and found someone they could torch.
November 28, 2024 at 7:49 PM
Yeah, it's exhausting! Other than you i am here clear name.

I don't know why you gatekeep "real" research.

Have I complied with law and ethics board in my research? Yes.

Am I happy to see grass roots / citizen science intiatives? Also yes.

Is the behaviour of the mob legitimate? Fuck no.
November 28, 2024 at 7:32 PM
A librarian receiving death threats for doing librarian things from people with all the progressive icons in their bio. I still have to reconcile that with my worldview somehow.
November 28, 2024 at 7:12 PM
You know, it's hard to study radicalisation on the internet with only opt in data because racist trolls - believe it or not - usually don't consent.

i'm not accusing you of being one, just describing the research I did.

You can still stop with your condescending smugness.
November 28, 2024 at 7:11 PM
Thank you for play, professor.
November 28, 2024 at 7:05 PM
Also to cite article 89 "Processing for archiving purposes in the public interest, scientific or historical research purposes or statistical purposes, shall be subject to appropriate safeguards"

Therefore, it is in principle okay to collect this data for the abovementioned purpose.
November 28, 2024 at 7:04 PM
I was a computational social scientist and worked with datasets like these more than five years ago.

I unequivocally think they are a net good to society and important for research. It is important that these datasets are made available easily to other researchers. Yes that includes AI research.
November 28, 2024 at 7:00 PM
The librarian was "just" harassed. The 2nd user (AlpinDale) who posted a link to a 2M entries HF dataset was banned.
November 28, 2024 at 6:57 PM
GDPR explicitly makes exceptions for datasets compiled for research purposes in Article 89.
November 28, 2024 at 6:51 PM
These data are important sources for (computational) social science. This scraping has been standard practice since more than a decade.

E.g. research on trajectories of online radicalisation and online polarisation via posts during the 1st trump presidency.

These datasets are important!
November 28, 2024 at 6:47 PM
I thought bluesky was a nicer place, but what i experienced here today almost makes me believe in horseshoe theory
November 28, 2024 at 6:44 PM
Especially since that is effectively what I did five years ago and what is standard across computational social science since more than a decade.

I have never before encountered this amount of hate from people I did not outright classify as right wing trolls.
November 28, 2024 at 6:41 PM
I mean I like that the EU has mechanisms for data protection in place. And it's definitely a conversation worth having if big companies and VCs earn a lot from tech that is built on uncompensated labor.

Despite that, I'm shocked at how the mob behaved toward that librarian.
November 28, 2024 at 6:40 PM