[seqfan] Re: Broken link hunt

Sun Aug 2 22:59:16 CEST 2020

Jean-Paul,
I agree that it is a very good idea (IIRC it had already been suggested on
this list...(?))
to "save" the page in the web.archive.org whenever contributors add a
non-persistent link on OEIS.
(But I think it is not necessary to put the archive.org link in addition to
the original, that can easily be done in an automatic manner when a link
breaks.)

I don't think web.archive.org would not like this "contribution" through
OEIS (and not even consider it as a "huge number" of links), on the
contrary:
I think they would prefer getting such "interesting" links as opposed to
all those they harvest by crawling recursively over the whole internet.
Also consider projects that are much larger than OEIS, such as Wikipedia,
which has nowadays the policy to add a link to the web.archive
for *all* of their "external links" (cf.
https://en.wikipedia.org/wiki/Wikipedia:Link_rot#Automatic_archiving).

Michel:
it is not needed to know the A-numbers, one can search for the link, even
if it is not displayed as plain text when you look at the published
sequence. Also, note that often there are many pages on the same server,
and it is more efficient to replace all of them,
e.g., http://planetmath.org/encyclopedia/
can be globally replaced by
https://web.archive.org/web/20140101/http://planetmath.org/encyclopedia/
to fix in one step all links to several pages on that web site, and
similarly for other web sites.
(But actually for this example at hand, the find & replace should be:
http://planetmath.org/encyclopedia/(.*).html   =>  https://planetmath.org/\1
<https://planetmath.org/BellNumber>
Namely, both "/encyclopedia" and the ".html" extension must be deleted, and
it should work again, unless the title changed.)

- Maximilian

On Sun, Aug 2, 2020 at 9:44 PM jean-paul allouche <
jean-paul.allouche at imj-prg.fr> wrote:

> Dear all
>
> The discussion about broken links gave me a --possibly stupid-- idea.
> As you know archive.org saves (some) sites, /but/ also permits to save
> an existing site. A possibility could be, when saving a link on the OEIS,
> to save it on archive.org and to also put that link on the oeis. Of course
> this makes twice as much work, but this is certainly better than to do it
> for a huge lots of links afterwards. The possible problem I see is whether
> archive.org would ultimately accept such a huge number of individual
> savings
> of links.
>
> jean-paul
>
>
> Le 02/08/2020 à 21:33, Georg.Fischer a écrit :
> > Hi Elijah,
> >
> > though it is very desirable, this is a Sisyphean task.
> >
> > In 2009 we repaired several hundreds of broken links,
> > and beginning of 2019 I made another attempt for
> > a big broken link action, but abandonned after weeks.
> > I still have a list of some 380 host addresses which
> > are not accessible (many of the host/~name URLs are
> > endangered).
> >
> > What I can provide rather easily is a complete list
> > of all URLs referred to in the OEIS and pointing to
> > some site outside. Then you would not need to crawl
> > on the OEIS server (and cause load on it), but simply
> > check links to the outside world (you must obey robots.txt
> > rules - some sites block if you don't).
> >
> > The problem is not so much to detect the broken links, but
> > - to decide whether there is already a replacment link
> >   in parallel, and
> > - to find out wether
> >   . there is a simple replacement and the old link is obsolete,
> >   . there is some replacement, but the old link should be kept,
> >   . there is a replacement in the Internat archive (wayback machine),
> >   . no replacement can be found,
> > - and then to edit the replacements in many sequence
> >   which have the same broken link.
> >
> > We did the latter from time to time in specific cases,
> > often with the aid of the author (target link owner).
> > In all cases we should ask resp. involve Neil before
> > we attempt big repair actions. The editing work should
> > be splitted among several editors, and we need a workflow
> > control mechanism.
> >
> > Best regards - Georg
> >
> >
> > Am 02.08.2020 um 20:41 schrieb michel.marcus at free.fr:
> >> I think your file.txt should have the OEIS link line, to be able to
> >> search for the link title on the web.
> >> And the A_number, to know where the corrected URL must be entered.
> >> Best.
> >> MM
> >>
> >> ----- Mail original -----
> >>
> >> De: "Elijah Beregovsky" <elijah.beregovsky at gmail.com>
> >> À: "Sequence Fanatics Discussion list" <seqfan at list.seqfan.eu>
> >> Envoyé: Dimanche 2 Août 2020 18:14:27
> >> Objet: [seqfan] Broken link hunt
> >>
> >> Hi, Seqfans!
> >> Everyone knows that there are loads of rotten links in the OEIS. For the
> >> past couple of days I've been trying to locate and fix as many as I can.
> >> But then my father suggested I automate this process, so I did exactly
> >> that. I made a (not very sophisticated) crawler that finds and stores
> >> in a
> >> file all links throwing Error 404. (
> >> https://github.com/BIGfoot496/OEIS-crawler) After approximately an
> >> hour of
> >> searching it returned a file with over a hundred links (in attachment).
> >> That's definitely not all of the dead links and I'm going to run the
> >> code
> >> for a much longer time, but this is already too much work for me to
> >> do it
> >> alone. Let's fix them!
> >> Elijah
> >>
> >> PS: I wouldn't reject coding help, because the crawler isn't nearly
> >> optimal
> >> yet. It only catches 404s and slows down significantly after working for
> >> some time.
> >>
> >>
> >
>
>
> --
> Seqfan Mailing list - http://list.seqfan.eu/
>