python - Pdf to txt from http request -



python - Pdf to txt from http request -

i have set of links pdf files:

https://www.duo.uio.no/bitstream/10852/9012/1/oppgave-2003-10-30.pdf

some of them restricted, meaning won't able access pdf file, while others go straight pdf file itself, link above.

i'm using requests bundle (python) access files, there far many files me download, , don't want files in pdf.

what go each link, check if link pdf file, download file (if necessary), turn txt file, , delete original pdf file.

i have shell script pdf txt converter, possible run shell script python?

kieran bristow has answered part of question how run external programme python.

the other part of question selectively downloading documents checking whether resource pdf document. unless remote server offers alternate representations of documents (e.g. text version), need download documents. avoid downloading non-pdf documents can send initial head request , @ reply headers determine content-type this:

import os.path import requests session = requests.session() url in [ 'https://www.duo.uio.no/bitstream/10852/9012/1/oppgave-2003-10-30.pdf', 'https://www.duo.uio.no/bitstream/10852abcd/90121023/1234/oppgave-2003-10-30.pdf']: try: resp = session.head(url, allow_redirects=true) resp.raise_for_status() if resp.headers['content-type'] == 'application/pdf': resp = session.get(url) if resp.ok: open(os.path.basename(url), 'wb') outfile: outfile.write(resp.content) print "saved {} file {}".format(url, os.path.basename(url)) else: print 'get request url {} failed http status "{} {}"'.format(url, resp.status_code, resp.reason) except requests.httperror exc: print "head failed url {} : {}".format(url, exc)

python shell http pdf converter

Comments

Popular posts from this blog

c - Compilation of a code: unkown type name string -

java - Bypassing "final local variable defined in an enclosing type" -

json - Hibernate and Jackson (java.lang.IllegalStateException: Cannot call sendError() after the response has been committed) -