topic Re: For large amounts of data, is it faster to use python to process JSON asynchronously into structured data? in Discussions

For large amounts of data, is it faster to use python to process JSON asynchronously into structured data?

lala — Thu, 05 Sep 2024 12:26:23 GMT

The JSON of the following structures is convenient to handle with JSL of JMP software.The JSON format is fixed and only 13 columns in [] are extracted.(Only two files are listed here.)

JSON1

{"ZJB":4271175,"ZJS":-3443749,"trend":[["09:30",0,-444931,0,-444931,0,1,0,21100,0,0,0,444931],["09:33",2,1433022,1433022,0,2,0,67100,0,0,1433022,0,0],["09:34",3,-316128,0,-316128,0,1,0,14800,0,0,0,316128],["09:45",4,318570,318570,0,1,0,15000,0,0,318570,0,0],["09:52",5,403965,403965,0,1,0,19100,0,0,403965,0,0],["10:03",7,-345725,328755,-674480,1,1,15500,31800,328755,0,0,674480],["10:25",8,419440,419440,0,1,0,19600,0,419440,0,0,0],["10:32",9,-623500,0,-623500,0,1,0,29000,0,0,0,623500],["10:40",10,353925,353925,0,1,0,16500,0,0,353925,0,0],["13:52",11,-1065500,0,-1065500,0,1,0,50000,0,0,0,1065500],["14:17",12,332436,332436,0,1,0,15600,0,332436,0,0,0],["14:25",13,-319214,0,-319214,0,1,0,15000,0,0,319214,0],["14:54",14,681065,681065,0,1,0,31900,0,0,681065,0,0]],"active":1080631,"passive":3190547,"Active":-319214,"Passive":-3124539,"AvgPrice":21.32,"AvgPrice":21.3,"time2":1725535598,"ttag":0.004174999999999929,"errcode":"0"}

JSON2

{"ZJB":1913404,"ZJS":-4366449,"trend":[["09:30",0,-730500,0,-730500,0,1,0,50000,0,0,0,730500],["09:34",1,402408,402408,0,1,0,27600,0,0,402408,0,0],["09:52",2,-442380,0,-442380,0,1,0,30300,0,0,0,442380],["10:51",3,-314545,0,-314545,0,1,0,21500,0,0,0,314545],["11:17",4,-339184,0,-339184,0,1,0,23200,0,0,0,339184],["13:06",5,-438600,0,-438600,0,1,0,30000,0,0,0,438600],["13:27",6,-337491,0,-337491,0,1,0,23100,0,0,337491,0],["13:47",7,-323676,0,-323676,0,1,0,22200,0,0,0,323676],["13:49",8,-447299,0,-447299,0,1,0,30700,0,0,0,447299],["14:00",9,-630448,0,-630448,0,1,0,43300,0,0,630448,0],["14:27",11,344796,707124,-362328,1,1,48400,24800,707124,0,0,362328],["14:31",12,320426,320426,0,1,0,21902,0,320426,0,0,0],["14:32",13,483449,483449,0,1,0,33000,0,0,483449,0,0]],"active":1027550,"passive":885857,"Active":-967939,"Passive":-3398512,"AvgPrice":14.62,"AvgPrice":14.6,"time2":1725535597,"ttag":0.0024069999999999925,"errcode":"0"}

But is python's asynchronous processing faster when these files are large?
But I don't know how python handles this JSON, and I asked ChatGPT for an answer.
So ask community experts.Thank you very much!

Please make this file in the C:\8 directory and want to call python by encoding it into JSL.Concatenate all files into a JMP table.

This requires that each file also have one additional file name.

Re: For large amounts of data, is it faster to use python to process JSON asynchronously into structured data?

lala — Thu, 05 Sep 2024 12:30:58 GMT

ChatGPT

JSL

# Define the Python code block
pythonCode = """
import os
import json
import asyncio
import aiofiles

# Define the directory to process
directory = 'C:/8/'

# Read and parse JSON files
async def process_file(file_path, encoding):
    async with aiofiles.open(file_path, mode='r', encoding='utf-8') as file:
        content = await file.read()
        data = json.loads(content)
        trend_data = data.get('trend', [])
        
        # Extract data and add encoding column
        rows = []
        for entry in trend_data:
            entry.append(encoding)  # Add filename as encoding column
            rows.append(entry)
        return rows

# Asynchronously process all files
async def process_all_files():
    all_data = []
    for filename in os.listdir(directory):
        if filename.endswith('.json'):
            file_path = os.path.join(directory, filename)
            encoding = os.path.splitext(filename)[0]  # Extract filename (without extension)
            rows = await process_file(file_path, encoding)
            all_data.extend(rows)
    return all_data

# Run asynchronous tasks
def run_asyncio_fetch():
    return asyncio.run(process_all_files())

# Get data
data = run_asyncio_fetch()
"""

# Submit the Python code to JMP's Python environment
Python Submit(pythonCode)

# Retrieve the processed data from Python
results = Python Get("data")

Re: For large amounts of data, is it faster to use python to process JSON asynchronously into structured data?

jthi — Thu, 05 Sep 2024 14:17:37 GMT

Someone from JMP might know more technical answer, I just have questions for you:

What do you consider large file?
How many files do you have?
Have you tried with different file sizes?
Have you tried with JMP using different methods
- Open single file
- Load as text
- Multiple File Import
- Python integration
Have you tried with python while not using JMP integration?

Re: For large amounts of data, is it faster to use python to process JSON asynchronously into structured data?

lala — Thu, 05 Sep 2024 14:35:10 GMT

Thank jthi!

I am currently dealing with more than 5000 such JSON files, and the key is to be processed in a short time.
I used python's asyncio concurrent download via ChatGPT only recently and found it to be faster than JMP concurrent download.These JSON files are downloaded in this way.

I am now using python asyncio first to quickly and concurrently download the original JSON file to save on the computer.
So I also wanted to try python's asynchronous handling of JSON, but save JMP tables directly.

I just wanted to try it first and didn't think much about anything else.
I hope experienced experts can give guidance and help.Thank you very much!

Re: For large amounts of data, is it faster to use python to process JSON asynchronously into structured data?

lala — Thu, 05 Sep 2024 14:43:20 GMT

I only know a little about JSL.So I'm not familiar with how JSL and python can handle data better and faster in memory.

Thanks!

Re: For large amounts of data, is it faster to use python to process JSON asynchronously into structured data?

jthi — Thu, 05 Sep 2024 14:54:01 GMT

Maybe this could be a good time to start thinking a bit more about this?

More questions (and few I did ask earlier):

What is large file?
What is "short time" for processing?
Is the issue getting the data from JSON to JMP or getting the data downloaded?
Have you tried loading the JSON using JMP, for example with Multiple File Import?
Do you always have batches of 5000+ files?

And going a bit further

Maybe that data should be stored to database in scheduled manner (and at this point this comes not a JMP question for a long time. JMP can load the data from database if needed)?

Re: For large amounts of data, is it faster to use python to process JSON asynchronously into structured data?

lala — Thu, 05 Sep 2024 23:09:42 GMT

Thank the experts for thinking from a higher perspective.
I further explain: these data are available every minute, and the key is to calculate after downloading.

So I'm still dealing with one piece at a time.

I found that processing from JSON to tables took a long time.Want to try asynchronous processing in python.
I hope experts can give specific guidance.Thanks!

Re: For large amounts of data, is it faster to use python to process JSON asynchronously into structured data?

jthi — Fri, 06 Sep 2024 05:21:53 GMT

Are you pulling in more data every minute?

Let's say you currently have downloaded 5000 JSON files. You should process those only once into JMP table / database / somewhere else. After that you just keep parsing the new files and adding that new data to where ever you are storing it.

Re: For large amounts of data, is it faster to use python to process JSON asynchronously into structured data?

lala — Fri, 06 Sep 2024 06:34:42 GMT

Thank the experts for their patient follow-up.

I look like this:

Processing 5000 JSON files per minute.Centralize the calculations in JMP tables, save only a few summarized results, and use another JMP file (which can be easily handled in JSL).

The raw JSON is not saved, and neither is the merged JMP data table.

So I wanted to speed things up by processing data in memory.

Thanks!

Re: For large amounts of data, is it faster to use python to process JSON asynchronously into structured data?

jthi — Fri, 06 Sep 2024 07:16:47 GMT

So you are performing 5000 http requests a minute (7.2million a day) which all return a JSON (feels quite a lot of requests to single endpoint from one IP) or in some other way you are getting those 5000 separate files a minute?

Re: For large amounts of data, is it faster to use python to process JSON asynchronously into structured data?

lala — Fri, 06 Sep 2024 07:25:51 GMT

200 a minute is all I can handle in real time right now
A lot of it is after-the-fact

Thanks!

Re: For large amounts of data, is it faster to use python to process JSON asynchronously into structured data?

lala — Sun, 29 Sep 2024 02:31:24 GMT

Well, by comparing the consumption of each step, less time to download the data each time.

However, it takes more time to assemble and sort many data each time.

Ask experts: Which database has a speed advantage in this regard: splicing, sorting.The key is that this is only intermediate data, not stored.

I found that JMP18's splicing speed is significantly not as fast as JMP14's.

Thanks Experts!