python - Pdf to txt from http request -
python - Pdf to txt from http request -
i have set of links pdf files:
https://www.duo.uio.no/bitstream/10852/9012/1/oppgave-2003-10-30.pdf
some of them restricted, meaning won't able access pdf file, while others go straight pdf file itself, link above.
i'm using requests bundle (python) access files, there far many files me download, , don't want files in pdf.
what go each link, check if link pdf file, download file (if necessary), turn txt file, , delete original pdf file.
i have shell script pdf txt converter, possible run shell script python?
kieran bristow has answered part of question how run external programme python.
the other part of question selectively downloading documents checking whether resource pdf document. unless remote server offers alternate representations of documents (e.g. text version), need download documents. avoid downloading non-pdf documents can send initial head
request , @ reply headers determine content-type
this:
import os.path import requests session = requests.session() url in [ 'https://www.duo.uio.no/bitstream/10852/9012/1/oppgave-2003-10-30.pdf', 'https://www.duo.uio.no/bitstream/10852abcd/90121023/1234/oppgave-2003-10-30.pdf']: try: resp = session.head(url, allow_redirects=true) resp.raise_for_status() if resp.headers['content-type'] == 'application/pdf': resp = session.get(url) if resp.ok: open(os.path.basename(url), 'wb') outfile: outfile.write(resp.content) print "saved {} file {}".format(url, os.path.basename(url)) else: print 'get request url {} failed http status "{} {}"'.format(url, resp.status_code, resp.reason) except requests.httperror exc: print "head failed url {} : {}".format(url, exc)
python shell http pdf converter
Comments
Post a Comment