AWS Lambda (python) for Batch processing of wav files from S3 to S3 #20

ted-pvh · 2024-09-30T16:28:58Z

ted-pvh
Sep 30, 2024

Hello all, I'm trying to create an AWS solution to bulk process 10,000 files each day.
Initially I'm developing a single S3 object processor in AWS Lambda ( python ).
( I think later the appropriate choreography architecture might involve AWS Step Functions Distributed Map but I'm not there yet )
I've had some struggles & some successes.
I can't seem to use speechmatics-python library in AWS Lambda as when I package it in a layer it is too big. ( 300MB > 250MB limit )
As Requests is no longer part of the base AWS Lambda, I attempted to use urllib.requests & http.client but I couldn't initially get multipart/form-data working. ( speechmatics since sent me draft way to do this ).
I did finally manage to package Requests as a layer & that is now working ( hooray ).
I'm now currently getting the error:
"message": "Error in sending notification: unable to send notification: Response status: 403, retrying",
I've made the URLs for the s3.get_object & s3.put_object use pre-signed URLs.
The transcript shows up successfully in the Speechmatics Portal when I direct the message to that server.
so I think that the read-get from s3 is working
But the resultant file never shows up on my s3 bucket from either the Speechmatics server nor my own Virtual Appliance.

It would be really great if there was a reference AWS Lambda that did the base case.
I'd be happy to work on this with anyone.
Thanks in advance - Ted

ted-pvh · 2024-09-30T18:48:16Z

ted-pvh
Sep 30, 2024
Author

Some additional content...
As you can imagine, the Speechmatics server needs an ability ( authority ) to write to S3.
A draft pre-signed URL is created approximately like:

       try:
             presigned_url_out = s3_client.generate_presigned_url(
                ClientMethod = "put_object"
                , Params = {"Bucket": s3_bucket_name, "Key": key_out, 'ContentType': 'application/json' }
                , ExpiresIn = expiration
                , HttpMethod = 'PUT'
            )
        except ClientError as e:
            logging.error(e)

What seems to be possibly relevant is:
the pre-signed URL seems to need to 'match' the ultimate 'write' from the speechmatics server.
But since I don't have the details of the server-write, I'm not sure the HttpMethod ( ? should it be POST instead of PUT ), ContentType, etc.
? Does anyone have success with Speechmatics writing to S3 ( using pre-signed URL )

0 replies

ted-pvh · 2024-10-08T14:05:41Z

ted-pvh
Oct 8, 2024
Author

Some more information ( for all )
The speechmatics server appends the job_id & status ( e.g. success ) to the notification URL.
I think this invalidates the pre-signed URL

I tried to hack this by appending a #fragment which should cause everything at the end of the URL to be ignored by the web server ( as the fragment interpretation is done by the browser )
But it seems like the speechmatics server doesn't use the URL w/o modification but instead parses & re-writes it dropping the #fragment suffix

I think speechmatics advice is to 'green-light' the speechmatics source server IPs so that they can write to S3.
I've not been able to get this to work either. ( frustrating )

0 replies

ted-pvh · 2024-10-17T17:13:27Z

ted-pvh
Oct 17, 2024
Author

Also, does anyone monitor these topics? I'm not seeing much activity here.

0 replies

ted-pvh · 2024-10-18T18:50:18Z

ted-pvh
Oct 18, 2024
Author

OK so it seems confirmed that:
pre-signed URLs are corrupted by Speechmatics server's addition of job_id and status as additional query parameters on URL.
So the working method is:
Use 'un-signed' URL ( not signed ) & 'green-light' the IPs of the speechmatics servers
& not:
Use 'pre-signed' URL

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speechmatics

AWS Lambda (python) for Batch processing of wav files from S3 to S3 #20

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Speechmatics

AWS Lambda (python) for Batch processing of wav files from S3 to S3 #20

Uh oh!

ted-pvh Sep 30, 2024

Replies: 4 comments

Uh oh!

ted-pvh Sep 30, 2024 Author

Uh oh!

ted-pvh Oct 8, 2024 Author

Uh oh!

ted-pvh Oct 17, 2024 Author

Uh oh!

ted-pvh Oct 18, 2024 Author

ted-pvh
Sep 30, 2024

ted-pvh
Sep 30, 2024
Author

ted-pvh
Oct 8, 2024
Author

ted-pvh
Oct 17, 2024
Author

ted-pvh
Oct 18, 2024
Author