cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Check out the JMP® Marketplace featured Capability Explorer add-in
Choose Language Hide Translation Bar
ron_horne
Super User (Alumni)

utf-8 encoding - cant read html

 

Dear members of the community,

i was trying to open the page amazon.co.uk in the following way but without success:

Names Default To Here( 1 );
webpage= Load Text File ("https://www.amazon.co.uk");

this is what i get in the log:


JMP could not read all the text in 'www.amazon.co.uk' correctly.
The document might use a text encoding that JMP does not recognize; or JMP might have chosen the wrong encoding (utf-8). Text that JMP could not understand has been removed.  The JMP character set preference is set to utf-8.

The location of the first read error is 4 and the byte at that location was 0. 

"��t������ ��`��L&��dv'�B^�V ��f��
�;�焎�3�=�wEG1E�����p��������!���m��X�f{����$�f���6=�ٓί�g!(j
�ŨI*|T�M�[ٗ�\!"��HA� �\��Y �$o������&����P�� ��*K&�;\!"��C��G^b��� E�i J����f:{{�!7(G�A��U���z+�7�D�<�7U�Y

......

if i attempt the same command with amazon.com all is fine.


i have checked in the html code of the page and it appears to have this about 7 times. so i am assuming it is utf-8

meta charset="utf-8"

also tried to change the preferences Text data file >> Open text file charset >> Best guess but it still doesn't work
would be grateful for any ideas.

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Craige_Hales
Super User

Re: utf-8 encoding - cant read html

Interesting. Amazon is compressing the data. Two ways to fix it; the new http request way lets you check if it is compressed:

webpage = Load Text File( "https://www.amazon.co.uk", blob() );
webpage = Blob To Char( Gzip Uncompress( webpage ) );
Write( webpage );

// the better way...

request = New HTTP Request( URL( "https://www.amazon.co.uk" ), Method( "GET" ) );
fetch = request << Send;
Show( (request << Get Response Headers)["content-encoding"] );// gzip
webpage = Blob To Char( Gzip Uncompress( fetch ) );
Write( webpage );

 

Craige

View solution in original post

4 REPLIES 4
Craige_Hales
Super User

Re: utf-8 encoding - cant read html

Interesting. Amazon is compressing the data. Two ways to fix it; the new http request way lets you check if it is compressed:

webpage = Load Text File( "https://www.amazon.co.uk", blob() );
webpage = Blob To Char( Gzip Uncompress( webpage ) );
Write( webpage );

// the better way...

request = New HTTP Request( URL( "https://www.amazon.co.uk" ), Method( "GET" ) );
fetch = request << Send;
Show( (request << Get Response Headers)["content-encoding"] );// gzip
webpage = Blob To Char( Gzip Uncompress( fetch ) );
Write( webpage );

 

Craige
lala
Level VIII

Re: utf-8 encoding - cant read html

  • Can JSL detect if fetch is in compressed format?

Thanks Experts!

2023-12-08_10-15-04.png

Craige_Hales
Super User

Re: utf-8 encoding - cant read html

Line 1 creates a HTTP Request object and puts it in the request variable. The amazon url is in the object's data, but no communication has happened yet so the request object doesn't know what will happen.

Line 2 sends the GET request to amazon; the payload part of the answer from amazon is copied into the fetch variable. You'll see a note in the log that a blob was stored because the data was binary data rather than proper Unicode. There is other metadata from amazon as well; the headers amazon returns are kept in the HTTP Request object.

Line 3 prints a value from the stored headers--specifically the content-encoding header. This needs to have more if(...) logic to test if there is a content-encoding header in the associative array, and then if its value is "gzip" before choosing to decompress the payload.

Line 4 depends on fetch holding a gzipped blob for the payload. It then depends on the binary data that was unGzipped being valid UTF-8 that can be converted to characters.

Line 5 prints all the payload data; there is a lot and JMP will truncate it in the log if you don't explicitly write(...) it.

Craige
Craige_Hales
Super User

Re: utf-8 encoding - cant read html

I sent a note to the support team. I think JMP should have handled this, either with a better message or by detecting the compression and doing the extra steps. That's a little problematic, maybe, but would work until something other than gzipped text shows up...99% probably.

Craige