608B71FC-006A-4934-A643-7D9BA9340450Blog

Harnessing the Power of Scramjet Sequences for Streamlined Data Processing

blog__author-img

29 February 2024

In the rapidly evolving landscape of cloud computing, the Scramjet Cloud Platform (SCP) emerges as a powerful tool for developers seeking to deploy, manage, and execute their code in the cloud efficiently. A central feature of this platform is the Scramjet Sequence - a potent mechanism that encapsulates a user's source code along with all its dependencies into a ready-to-deploy package. This article provides a comprehensive guide on creating, deploying, and managing Scramjet Sequences, focusing on a Python-based example for speech-to-text conversion using AssemblyAI.

Understanding Scramjet Sequences

A Scramjet Sequence is essentially a container for your algorithm, encompassing the code and all necessary dependencies. It is designed to be executed on the Scramjet Cloud Platform, facilitating seamless data processing. Through the Scramjet Framework, Sequences gain the ability to asynchronously process data streams, enabling them to produce, consume, and transform data efficiently.

Preparing Your Sequence

To embark on this journey, let's explore the audio2text-input Python sample available on Scramjet's platform samples GitHub repository. This example demonstrates the integration of AssemblyAI for speech recognition within a Scramjet Sequence. To run this Sequence, you'll need your AssemblyAI token, which must be included when executing the start command:


_10
si sequence start <Sequence-id> --args=[\"<AssemblyAI-token>\"]

Retrieving the output, such as the transcribed text from an audio file, can be done using the Scramjet Interface (si) output command or the output API endpoint, further details of which can be found in the API reference.


_10
si instance output <Instance-id>

Crafting Your Handler

The heart of a Scramjet Sequence lies in its ability to handle asynchronous data processing. This is achieved through defining a main function that takes context and input stream arguments, processing them to produce the desired output.


_41
import requests
_41
import time
_41
import json
_41
from scramjet.streams import Stream
_41
_41
async def run(context, input, token):
_41
audio_file = await input.reduce(lambda a, b: a+b)
_41
base_url = "https://api.assemblyai.com/v2"
_41
_41
headers = {
_41
"authorization": token
_41
}
_41
_41
response = requests.post(
_41
base_url + "/upload",
_41
headers=headers,
_41
data=audio_file
_41
)
_41
upload_url = response.json()["upload_url"]
_41
data = {
_41
"audio_url": upload_url
_41
}
_41
url = base_url + "/transcript"
_41
response = requests.post(url, json=data, headers=headers)
_41
_41
transcript_id = response.json()['id']
_41
polling_endpoint = f"https://api.assemblyai.com/v2/transcript/{transcript_id}"
_41
_41
while True:
_41
transcription_result = requests.get(polling_endpoint, headers=headers).json()
_41
_41
if transcription_result['status'] == 'completed':
_41
break
_41
_41
elif transcription_result['status'] == 'error':
_41
raise RuntimeError(f"Transcription failed: {transcription_result['error']}")
_41
_41
else:
_41
time.sleep(3)
_41
_41
return Stream.read_from(f"{transcription_result['text']} \n")

This code reads the input stream (chunks of binary audio data in wave format) into a single audio file using the reduce function in an asynchronous manner. The prepared audio file is then uploaded to AssemblyAI's /upload endpoint using a POST request. Once the transcription is completed, it reads the transcribed text from the transcription_result and wraps it in a stream Stream.read_from and returns it to the Sequence output. This output can be consumed either through Instance /output API endpoint or through Scramjet SDK or si.

Packaging Your Sequence

Before deployment, it's essential to package your Sequence into a .tar.gz file, incorporating the main.py script, any additional files, and dependencies. This process involves creating a requirements.txt file for Python dependencies, a package.json file for defining the Sequence, and organizing all necessary files into a coherent structure ready for packaging.

requirements.txt


_10
scramjet-framework-py
_10
requests
_10
pyee
_10
urllib3==1.26.6
_10
pyOpenSSL

package.json


_24
{
_24
"name": "audio2text-input",
_24
"version": "1.0.0",
_24
"main": "./main.py",
_24
"author": "Ray_Nawfal",
_24
"license": "GPL-3.0",
_24
"description": "Transcription of an audio file using AssemblyAI API.",
_24
"keywords": [
_24
"AudioToText",
_24
"Transcription",
_24
"AssemblyAI"
_24
],
_24
"repository": {
_24
"type": "git",
_24
"url": "https://github.com/scramjetorg/platform-samples/tree/main/python/audio2text-input"
_24
},
_24
"engines": {
_24
"python3": "3.8.0"
_24
},
_24
"scripts": {
_24
"build": "mkdir -p dist/__pypackages__/ && pip3 install -t dist/__pypackages__/ -r requirements.txt && cp -t ./dist/ *.py *.json",
_24
"clean": "rm -rf ./dist"
_24
}
_24
}

1. create directory __pypackages__ in the same directory as main.py


_10
mkdir __pypackages__

2. Installing dependencies in the __pypackages__ folder. If the user uses any packages that are written in C language, in order to run them on SCP a user needs to install the dependencies on a Linux Machine. Simply because Scramjet Cloud Platform runs on Linux OS.


_10
pip3 install -t __pypackages__ -r requirements.txt

A Sequence can be packed manually in the form of a tar.gz file before being sent to Scramjet Cloud Platform through the command-line:


_10
si sequence pack <path/to/Sequence/folder>

Deploying on Scramjet Cloud Platform

Deployment involves packing and sending your Sequence to the Scramjet Cloud Platform. This can be done manually or through the si sequence deploy command, which facilitates the process by packing, sending and starting the Sequence. Detailed steps for packaging and deployment are available in the CLI reference.


_10
si sequence send <path/to/filename.tar.gz> --progress

Monitoring and Logs

Once deployed, monitoring your Sequence's performance and output is crucial. The Scramjet Cloud Platform provides various methods to access logs, including through the SCP Console Panel, the logging library within your Python script, or API endpoints for stdout, stderr, and output logs. These tools offer insights into the execution and performance of your Sequence, enabling you to debug and optimize as necessary.

Utilizing Events for Enhanced Interactivity

Scramjet Sequences are not just about processing data; they're also about interaction. Sequences can communicate through Topics, or triggering and respond to Events asynchronously. This capability allows for complex workflows and interactions between different Sequences, enhancing the platform's flexibility and extensibility.

Final thoughts

Scramjet Sequences offer a powerful, flexible, and efficient way to handle data processing in the cloud. By following the steps outlined in this guide, you can leverage the Scramjet Cloud Platform to deploy and manage your Python-based data processing tasks with ease. Whether you're converting audio to text, analyzing streams of data, or integrating various cloud services, Scramjet provides the tools and infrastructure to bring your projects to life in the cloud.

Register now for your free trial HERE.
Checkout Scramjet platform samples on GitHub.

Project co-financed by the European Union from the European Regional Development Fund under the Knowledge Education Development Program. The project is carried out as a part of the competition of the National for Research and Development: Szybka Ścieżka.