python - Pdf to txt from http request -



python - Pdf to txt from http request -

i have set of links pdf files:

https://www.duo.uio.no/bitstream/10852/9012/1/oppgave-2003-10-30.pdf

some of them restricted, meaning won't able access pdf file, while others go straight pdf file itself, link above.

i'm using requests bundle (python) access files, there far many files me download, , don't want files in pdf.

what go each link, check if link pdf file, download file (if necessary), turn txt file, , delete original pdf file.

i have shell script pdf txt converter, possible run shell script python?

kieran bristow has answered part of question how run external programme python.

the other part of question selectively downloading documents checking whether resource pdf document. unless remote server offers alternate representations of documents (e.g. text version), need download documents. avoid downloading non-pdf documents can send initial head request , @ reply headers determine content-type this:

import os.path import requests session = requests.session() url in [ 'https://www.duo.uio.no/bitstream/10852/9012/1/oppgave-2003-10-30.pdf', 'https://www.duo.uio.no/bitstream/10852abcd/90121023/1234/oppgave-2003-10-30.pdf']: try: resp = session.head(url, allow_redirects=true) resp.raise_for_status() if resp.headers['content-type'] == 'application/pdf': resp = session.get(url) if resp.ok: open(os.path.basename(url), 'wb') outfile: outfile.write(resp.content) print "saved {} file {}".format(url, os.path.basename(url)) else: print 'get request url {} failed http status "{} {}"'.format(url, resp.status_code, resp.reason) except requests.httperror exc: print "head failed url {} : {}".format(url, exc)

python shell http pdf converter

Comments

Popular posts from this blog

Delphi change the assembly code of a running process -

json - Hibernate and Jackson (java.lang.IllegalStateException: Cannot call sendError() after the response has been committed) -

C++ 11 "class" keyword -