Discussion:
retrieving text from html5 page?
v***@sbt.net.au
2014-01-08 23:52:57 UTC
Permalink
I have a script like:

wget -O page.html url
lynx -dump page.html > page.txt

that worked TILL web server was redeveloped;

now they use html5 stuff, and, page.html has data I want, but, page.txt
only has 'labels' but not data contents, andy thought how I can do
that...?

when displayed on screen, data shows, in text file, not

looking at page.html it has like:


/snip/
<label class="pfbc-label">Suburb</label><input type="text"
name="SYS_Addresses_e_address_i_0_e_district_tx" value="SYDNEY"
readonly="readonly" class="ro pfbc-textbox"/>

<label class="pfbc-label">State</label><input type="hidden" value="NSW"
name="SYS_Addresses_e_address_i_0_e_state_cd"><input type="text"
name="SYS_Addresses_e_address_i_0_e_state_cd_d" value="NSW"
readonly="readonly" class="ro pfbc-textbox"/>

<label class="pfbc-label">Postcode</label><input type="text"
name="SYS_Addresses_e_address_i_0_e_postcode_tx" value="2000"
readonly="readonly" class="ro pfbc-textbox"/>
Klaus-Peter Wegge
2014-01-09 08:11:57 UTC
Permalink
Dear Lynx experts,

with the update to Debian 6 (wheezy) I've updated automatically to lynx
2.8.8.dev.12 .
Besides some improvements I have found the following problem in the
print menue:
When printing to a local file (first entry) the suffix of the suggested
file name is now "*.htm" and no longer "*.txt" as it was in earlier
versions. The suffix "*.htm" should only be suggested when viewing source.
Moreover, if the file already exitst, you cannot longer abord
by a empty file name (ctrl-u). If the filename is empty, the message
"file alredy exists" still appears.

Example:
lynx myfile.htm
select "Save file" in print menue
The suggested filename is "myfile.htm" (expected myfile.txt).
press return to save.
Message "file already exists" (correct, it's the source file from the command line)
ctrl-u for empty file name
press return: filename is empty, message "file already exists"...

I don't know with which lynx version this "logic" has changed.

Kind regards

Klaus
fa-ml
2014-01-09 11:41:25 UTC
Permalink
Post by v***@sbt.net.au
now they use html5 stuff, and, page.html has data I want, but, page.txt
only has 'labels' but not data contents, andy thought how I can do
that...?
Not a solution but a possibly a workaround
Post by v***@sbt.net.au
<label class="pfbc-label">Postcode</label><input type="text"
name="SYS_Addresses_e_address_i_0_e_postcode_tx" value="2000"
readonly="readonly" class="ro pfbc-textbox"/>
If data is as structured as the one you pasted, it would make sense to use
a parsing library to extract the needed info from the XML.
It would require a bit more effort, but probably be safer than handling
freeform text.
Ian Collier
2014-01-09 12:34:22 UTC
Permalink
Post by v***@sbt.net.au
now they use html5 stuff, and, page.html has data I want, but, page.txt
only has 'labels' but not data contents, andy thought how I can do
that...?
I note that on one system where I have Lynx 2.8.7 the text field contents
don't show up in the dump, but on another where I have Lynx 2.8.8dev15
the contents do show up. Therefore it might be worth upgrading Lynx.

imc
v***@sbt.net.au
2014-01-11 01:19:57 UTC
Permalink
Post by Ian Collier
I note that on one system where I have Lynx 2.8.7 the text field contents
don't show up in the dump, but on another where I have Lynx 2.8.8dev15
the contents do show up. Therefore it might be worth upgrading Lynx.
thanks!!
after succeeding in installing latest dev build from home page, I'm
getting most of the text:

in -dump text, some fields have been 'trimmed off' at the end, missing one
of more traling chars;

in screen view, these fields do not fit on screen, BUT, allow me to enter
each field, and, scroll left/right to see all data,
when a field is entered, field displays 'scroll indicators' '<' '>' to
indicate 'there is more'

is there any switch to get all data dumped ?

/usr/local/bin/lynx --version
Lynx Version 2.8.8dev.17 (28 Nov 2013)
libwww-FM 2.14, ncurses 5.5.20060715
Built on linux-gnu Jan 10 2014 22:42:53
v***@sbt.net.au
2014-01-11 06:02:10 UTC
Permalink
Post by v***@sbt.net.au
in -dump text, some fields have been 'trimmed off' at the end, missing
one of more traling chars;
in screen view, these fields do not fit on screen, BUT, allow me to enter
each field, and, scroll left/right to see all data, when a field is
entered, field displays 'scroll indicators' '<' '>' to indicate 'there is
more'
looking at source file, if I edit cols="30" below to a higher value, I get
my text, is there a way to pass such with -dump swicth ?

<label class="pfbc-label">Special requirements </label><textarea
name="spec_req_tx" readonly="readonly" class="ro pfbc-textarea" rows="6"
cols="30"></textarea>
Thomas Dickey
2014-01-12 01:23:20 UTC
Permalink
Post by v***@sbt.net.au
Post by Ian Collier
I note that on one system where I have Lynx 2.8.7 the text field contents
don't show up in the dump, but on another where I have Lynx 2.8.8dev15
the contents do show up. Therefore it might be worth upgrading Lynx.
thanks!!
after succeeding in installing latest dev build from home page, I'm
in -dump text, some fields have been 'trimmed off' at the end, missing one
of more traling chars;
With -dump, you get just a picture of what is on the screen.
You might (depending on the form) get bigger fields by setting
the -width parameter (but that only allows up to 1000 columns).

A -source of course gives the whole file - but not formatted.
--
Thomas E. Dickey <***@invisible-island.net>
http://invisible-island.net
ftp://invisible-island.net
v***@sbt.net.au
2014-01-09 13:08:43 UTC
Permalink
Post by Ian Collier
I note that on one system where I have Lynx 2.8.7 the text field contents
don't show up in the dump, but on another where I have Lynx 2.8.8dev15
the contents do show up. Therefore it might be worth upgrading Lynx.
Ian, thanks for this

trying latest 'dev 2-8-8', get:
...
checking for screen type... curses
checking for specific curses-directory... no
checking for extra include directories... no
checking if we have identified curses headers... none
configure: error: No curses header-files found

./configure --with-screen=ncurses
...
checking for screen type... ncurses
checking for specific curses-directory... no
Looking for ncurses-config
checking for ncurses6-config... no
checking for ncurses5-config... no
checking for ncurses header in include-path... no
checking for ncurses include-path... configure: error: not found

I'll try again tommorrow, getting late here

thanks
Loading...