Harnessing the Power of Scramjet Sequences for Streamlined Data Processing

In the rapidly evolving landscape of cloud computing, the Scramjet Cloud Platform (SCP) emerges as a powerful tool for developers seeking to deploy, manage, and execute their code in the cloud efficiently. A central feature of this platform is the Scramjet Sequence - a potent mechanism that encapsulates a user's source code along with all its dependencies into a ready-to-deploy package. This article provides a comprehensive guide on creating, deploying, and managing Scramjet Sequences, focusing on a Python-based example for speech-to-text conversion using AssemblyAI.

Understanding Scramjet Sequences

A Scramjet Sequence is essentially a container for your algorithm, encompassing the code and all necessary dependencies. It is designed to be executed on the Scramjet Cloud Platform, facilitating seamless data processing. Through the Scramjet Framework, Sequences gain the ability to asynchronously process data streams, enabling them to produce, consume, and transform data efficiently.

Preparing Your Sequence

To embark on this journey, let's explore the audio2text-input Python sample available on Scramjet's platform samples GitHub repository. This example demonstrates the integration of AssemblyAI for speech recognition within a Scramjet Sequence. To run this Sequence, you'll need your AssemblyAI token, which must be included when executing the start command:


_10si sequence start <Sequence-id> --args=[\"<AssemblyAI-token>\"]

Retrieving the output, such as the transcribed text from an audio file, can be done using the Scramjet Interface (si) output command or the output API endpoint, further details of which can be found in the API reference.


_10si instance output <Instance-id>

Crafting Your Handler

The heart of a Scramjet Sequence lies in its ability to handle asynchronous data processing. This is achieved through defining a main function that takes context and input stream arguments, processing them to produce the desired output.


_41import requests
_41import time
_41import json
_41from scramjet.streams import Stream
_41
_41async def run(context, input, token):
_41    audio_file = await input.reduce(lambda a, b: a+b)
_41    base_url = "https://api.assemblyai.com/v2"
_41
_41    headers = {
_41        "authorization": token
_41    }
_41
_41    response = requests.post(
_41        base_url + "/upload",
_41        headers=headers,
_41        data=audio_file 
_41    )
_41    upload_url = response.json()["upload_url"]
_41    data = {
_41        "audio_url": upload_url  
_41    }
_41    url = base_url + "/transcript"
_41    response = requests.post(url, json=data, headers=headers)
_41
_41    transcript_id = response.json()['id']
_41    polling_endpoint = f"https://api.assemblyai.com/v2/transcript/{transcript_id}"
_41
_41    while True:
_41        transcription_result = requests.get(polling_endpoint, headers=headers).json()
_41
_41        if transcription_result['status'] == 'completed':
_41            break
_41
_41        elif transcription_result['status'] == 'error':
_41            raise RuntimeError(f"Transcription failed: {transcription_result['error']}")
_41
_41        else:
_41            time.sleep(3)
_41
_41    return Stream.read_from(f"{transcription_result['text']} \n")

This code reads the input stream (chunks of binary audio data in wave format) into a single audio file using the reduce function in an asynchronous manner. The prepared audio file is then uploaded to AssemblyAI's /upload endpoint using a POST request. Once the transcription is completed, it reads the transcribed text from the transcription_result and wraps it in a stream Stream.read_from and returns it to the Sequence output. This output can be consumed either through Instance /output API endpoint or through Scramjet SDK or si.

Packaging Your Sequence

Before deployment, it's essential to package your Sequence into a .tar.gz file, incorporating the main.py script, any additional files, and dependencies. This process involves creating a requirements.txt file for Python dependencies, a package.json file for defining the Sequence, and organizing all necessary files into a coherent structure ready for packaging.

requirements.txt


_10scramjet-framework-py
_10requests
_10pyee
_10urllib3==1.26.6
_10pyOpenSSL

package.json


_24{
_24  "name": "audio2text-input",
_24  "version": "1.0.0",
_24  "main": "./main.py",
_24  "author": "Ray_Nawfal",
_24  "license": "GPL-3.0",
_24  "description": "Transcription of an audio file using AssemblyAI API.",
_24  "keywords": [
_24   "AudioToText",
_24   "Transcription",
_24   "AssemblyAI"
_24  ],
_24  "repository": {
_24    "type": "git",
_24    "url": "https://github.com/scramjetorg/platform-samples/tree/main/python/audio2text-input"
_24  },
_24  "engines": {
_24    "python3": "3.8.0"
_24  },
_24  "scripts": {
_24    "build": "mkdir -p dist/__pypackages__/ && pip3 install -t dist/__pypackages__/ -r requirements.txt && cp -t ./dist/ *.py *.json", 
_24    "clean": "rm -rf ./dist"
_24  }
_24}

1. create directory __pypackages__ in the same directory as main.py


_10mkdir __pypackages__

2. Installing dependencies in the __pypackages__ folder. If the user uses any packages that are written in C language, in order to run them on SCP a user needs to install the dependencies on a Linux Machine. Simply because Scramjet Cloud Platform runs on Linux OS.


_10pip3 install -t __pypackages__ -r requirements.txt

A Sequence can be packed manually in the form of a tar.gz file before being sent to Scramjet Cloud Platform through the command-line:


_10si sequence pack <path/to/Sequence/folder>

Deploying on Scramjet Cloud Platform

Deployment involves packing and sending your Sequence to the Scramjet Cloud Platform. This can be done manually or through the si sequence deploy command, which facilitates the process by packing, sending and starting the Sequence. Detailed steps for packaging and deployment are available in the CLI reference.


_10si sequence send <path/to/filename.tar.gz> --progress

Monitoring and Logs

Once deployed, monitoring your Sequence's performance and output is crucial. The Scramjet Cloud Platform provides various methods to access logs, including through the SCP Console Panel, the logging library within your Python script, or API endpoints for stdout, stderr, and output logs. These tools offer insights into the execution and performance of your Sequence, enabling you to debug and optimize as necessary.

Utilizing Events for Enhanced Interactivity

Scramjet Sequences are not just about processing data; they're also about interaction. Sequences can communicate through Topics, or triggering and respond to Events asynchronously. This capability allows for complex workflows and interactions between different Sequences, enhancing the platform's flexibility and extensibility.

Final thoughts

Scramjet Sequences offer a powerful, flexible, and efficient way to handle data processing in the cloud. By following the steps outlined in this guide, you can leverage the Scramjet Cloud Platform to deploy and manage your Python-based data processing tasks with ease. Whether you're converting audio to text, analyzing streams of data, or integrating various cloud services, Scramjet provides the tools and infrastructure to bring your projects to life in the cloud.

Project co-financed by the European Union from the European Regional Development Fund under the Knowledge Education Development Program. The project is carried out as a part of the competition of the National for Research and Development: Szybka Ścieżka.