0

I'm trying to implement the Yandex OCR translator tool into my code. With the help of Burp Suite, I managed to find that the following request is the one that is used to send the image:

request photo

I'm trying to emulate this request with the following code:

import requests
from requests_toolbelt import MultipartEncoder
files={
    'file':("blob",open("image_path", 'rb'),"image/jpeg")
    }

#(<filename>, <file object>, <content type>, <per-part headers>)
burp0_url = "https://translate.yandex.net:443/ocr/v1.1/recognize?srv=tr-image&sid=9b58493f.5c781bd4.7215c0a0&lang=en%2Cru"


m = MultipartEncoder(files, boundary='-----------------------------7652580604126525371226493196')

burp0_headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0", "Accept": "*/*", "Accept-Language": "en-US,en;q=0.5", "Accept-Encoding": "gzip, deflate", "Referer": "https://translate.yandex.com/", "Content-Type": "multipart/form-data; boundary=-----------------------------7652580604126525371226493196", "Origin": "https://translate.yandex.com", "DNT": "1", "Connection": "close"}

print(requests.post(burp0_url, headers=burp0_headers, files=m.to_string()).text)

though sadly it yields the following output:

{"error":"BadArgument","description":"Bad argument: file"}

Does anyone know how this could be solved?

Many thanks in advance!

2
  • I would not reproduce the boundary for the multipart upload here even. I certainly would not reproduce every single header either. Commented Feb 28, 2019 at 20:40
  • @MartijnPieters Thank you very much for your quick reply. how would you reproduce it then? Commented Feb 28, 2019 at 20:44

1 Answer 1

2

You are passing the MultipartEncoder.to_string() result to the files parameter. You are now asking requests to encode the result of the multipart encoder to a multipart component. That's one time too many.

You don't need to replicate every byte here, just post the file, and perhaps set the user agent, referer, and origin:

files = {
    'file': ("blob", open("image_path", 'rb'), "image/jpeg")
}

url = "https://translate.yandex.net:443/ocr/v1.1/recognize?srv=tr-image&sid=9b58493f.5c781bd4.7215c0a0&lang=en%2Cru"
headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0", 
    "Referer": "https://translate.yandex.com/",
    "Origin": "https://translate.yandex.com",
}

response = requests.post(url, headers=headers, files=files)
print(response.status)
print(response.json())

The Connection header is best left to requests, it can control when a connection should be kept alive just fine. The Accept* headers are there to tell the server what your client can handle, and requests sets those automatically too.

I get a 200 OK response with that code:

200
{'data': {'blocks': []}, 'status': 'success'}

However, if you don't set additional headers (remove the headers=headers argument), the request also works, so Yandex doesn't appear to be filtering for robots here.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.