Working on MèngZǐ, the Federated Search Engine Alternative

Clocks [They/Them]@lemmy.ml · 3 个月前

Working on MèngZǐ, the Federated Search Engine Alternative

Gayhitler@lemmy.ml · 3 个月前

It’s 孟子 for anyone wondering

theneverfox@pawb.social · 3 个月前

I love the idea

I’m starting to look at airflow for my own project, not sure if you’ve heard of it or projects like it, but it seems like a great foundation for a scraper. I’m still evaluating options for that, but so far it’s my pick

Hit me up if you get stuck or make a breakthrough, I’ve got a pretty good handle on activity pub and the lemmy API, and your project would add a lot to mine

Clocks [They/Them]@lemmy.ml · 3 个月前

What is airflow? I was considering to use spider.

utopiah@lemmy.ml · 3 个月前

any advice or suggestion, please do give

I haven’t build one so I can’t help as much… but I’d be honest from the start by comparing it, head-on, with alternatives (if I understood correctly) e.g. https://github.com/searxng/searxng and simultaneously, because it’s federated, make it interroperable with them.

Boomkop3@reddthat.com · 3 个月前

Looking trough their docs I can’t even find the word federate. From what I can tell it just refers to the part where the aggregate results

Scott M. Stolz@loves.tech · 3 个月前

This sounds interesting. I would love to hear about how it could integrate with other platforms.

Clocks [They/Them]@lemmy.ml · 3 个月前

What platforms do you have in mind?

Nutomic@lemmy.ml · 3 个月前

This sounds like a very interesting idea. I agree that Yacy doesnt work, when I checked it out years ago it was a completely bloated mess. Not sure how viable how your idea is, because Im not familiar with webrings, and not sure how the federation will work. Anyway the main challenge for this project will be to actually give useful search results, both early on when there are very few crawlers, and also later once spammers try to abuse it.

Clocks [They/Them]@lemmy.ml · 3 个月前

What will abuse look like in your mind?

francisco_1844@discuss.online · edit-2 3 个月前

Some of the ways abuse can happen

Crawling false data / misinformation on a topic
Putting info on search as part of a scam / spam campaign
Putting false news about events that are happening, or have not happened at all
Putting false information about a business competitor
Putting fake reviews about a product

Just a few that I can think off… existing websites have the issues too, but what is different is how existing sites decide relevance and how often said algorithms weed out the bad content . In my opinion a distributed search engine will have a harder time at combating those, and other potentials for abuse, because there is less control about what is getting scanned there is an open policy of who can join the distributed scanning.

Clocks [They/Them]@lemmy.ml · 3 个月前

The first one I feel is a legitimate issue that I should brainstorm. But is tricky to compute.

The rest seem to be something moderation may help with. But not directly solvable.

francisco_1844@discuss.online · 3 个月前

The rest seem to be something moderation may help

Who will moderate? If it is a distributed system and moderation is also distributed bad actors can automate upvotes or whatever means we use for moderation to keep their bad content up.

Nutomic@lemmy.ml · 3 个月前

Mainly SEO spam with text copied from other sites and lots of ads/referral links to make the owner a profit. But after thinking about it more, those would be rather easy to filter based on ad code in the HTML.

A much bigger challenge will be the ranking of search results. When searching for a term and there are 100 pages in the index that contain it, which of these pages should be shown first? Google developed the Pagerank when they started out, so that might be a good starting point to research further.

dohpaz42@lemmy.world · 3 个月前

An API that developers could use to integrate search in their projects would be nice. And that would also allow developers to create an app ecosystem.

This sounds very interesting, and I can’t wait to see what comes of it.

waldek@lemmy.86thumbs.net · 3 个月前

Did you ever stumble upon yacy? https://github.com/yacy The website seems down but the general direction of the project might be up your alley.

Christian@lemmy.ml · edit-2 3 个月前

I remember trying yacy over ten years ago and I really wanted to like it but it was functionally useless. The results I was getting would be unrelated to my query almost every time. I checked back periodically over the years but that never seemed to improve much.

It’s the first thing I thought of too though.

waldek@lemmy.86thumbs.net · 3 个月前

Honestly, I was the same. I tried to run it multiple times but the results where never good enough to make the switch. Just wanted to point out it’s existence to OP.

Clocks [They/Them]@lemmy.ml · 3 个月前

Do check my original blog, it explains somewhat the issue of YaCy.

Linux reviews also has a good article on it. https://linuxreviews.org/YaCy

rcbrk@lemmy.ml · edit-2 3 个月前

Really interesting proposal! To a degree the structure of Lemmy/Mbin/etc may be quite close to the categorising and moderating aspect, and might be a good place to start collecting URLs to crawl.

Each community could be considered analogous to a (rather chaotic) webring. When an instance doesn’t meet your moderation expectation, defederate; if a MengZi user wants to see search results from different defederated segments, use a MengZi instance that federates with both, or just have both plugged into a searx instance.

The categorising side of MengZi could be (from an activitypub perspective) like a very cut down version of lemmy –each webring/category being a community, each website being a post, comments disabled or limited/filtered to hashtags.

A webring could be a specific sort of category/community, where a submitted website’s url’s page must contain specific metadata definining its membership in that ring or it is automoderated and removed. Such a category could automoderate the url and title to be the default page defined by its membership metadata. Existing webring html element standards could suffice.

A website could be crossposted to other categories, including to other instances, even to/from lemmy or other compatible activitypub sites. If a (cross)posted post is not a url returning the correct mime type for a category then it can be automoderated and deleted; same for other arbitrary criteria a category could define.

A website/post on MengZi could be accompanied by relevant crawling metadata, even full search database data available via the api for sharing to other MengZi instances to save duplication of crawling effort while distributing the database.

marcie (she/her)@lemmy.ml · 3 个月前

Webring integration would be very cool so many queer people still use them

Clocks [They/Them]@lemmy.ml · 3 个月前

Thea idea revolves around web-rings, but I feel like you’re implying something different?

Avid Amoeba@lemmy.ca · 3 个月前

Keep us updated as you get more working.

Clocks [They/Them]@lemmy.ml · 3 个月前

🫡

Packet [none/use name]@hexbear.net · 3 个月前

Except Fascists. This software is made by socialists for socialist ideals.

Cinema

Clocks [They/Them]@lemmy.ml · 3 个月前

What does “cinema” mean?

NudeNewt@lemm.ee · 3 个月前

When used as a descriptor it’s meant to desribe the person, place, or thing as positive in some form or fashion. Sometimes it’s used in a mocking manner, it’s an extremely contextual term. Typically used in a satirical manner but in this case I believe their saying they agree or maybe they just think it’s funny. It’s hard to tell honestly.

It’s similar to the word “based”, very contextual, sometimes good sometimes bad.

Packet [none/use name]@hexbear.net · 3 个月前

I use it in context of cinematic, as in a positive way

Working on MèngZǐ, the Federated Search Engine Alternative

Working on MèngZǐ, the Federated Search Engine Alternative

Clocks / MèngZǐ · GitLab