How to Automate Web Capturing / Web Archiving Tasks (html, pdf or mht)?

Thomas · Jun 6, 2010

I love the Adobe Acrobat's Web Capture feature it's a very good way to make an archive version of a web page and i can save it with working links, so later i can click on the links to open them. But how can i automate this task and organize the created files with date or time also automatically?

Here is an example:
I would like to capture (archive) a news portal's first page (like cnn.com or msn.com) 3 or 4 times a day for later check and i don't want to miss any headlines or state of the web page. So i would like to automate a task, no matter if it's a command line Solution, to make a web-site capture from the first page, but if i'd like to i can click on the headlines (links) to read the entire article, content.

Right now, I have two solution to automate the web-capture task but they are NOT PERFECT:

Solution #1:
SiteBot: I tried many programs but this was the only one which can do website capture with organizing the tasks into separate folders regarding to the time and date. I used only the Trial version but as i tried it can made to do the tasks automatically without any problem. It's a nice program but it makes a little complicated file structure and i don't want to click so many times just for one HTML file and all the images stored in an other folder. I'd like if it would be one file with no separate image folder or the file be renamed to the date.

Solution #2:
This one is better. I made an automated batch task with SiteShoter, which is making captures in jpg of the webpage and a make a task in windows so it take shots of the webpage about 4 times a date. This works well, but it can not capture the links, so i can read the headlines but i can not click on them (because it's an image) so i can not read the entire article, content what is behind the headlines / links.

What i want:
- archivate a browsable version of a website with working hyperlink
- this procedure is automatic (3-4 times a day)
- the archived website's file name renamed to the date and archivedas created (like SiteShoter is doing).
- i don't want to be the website fully browsable offline, but the website's first page, so the actual state of the main webpage (as i said: cnn.com, msn.com)

Programs i tried and can not get success:
Website archivers:
- Offline browser / Offline Explorer
- HTTrack Website Copier
- WebStripper
- Locale Website Archiver / Website-Watcher
- Teleport Pro
- WebZip / WinMHT
- Batch HTML to MHT

Website to PDF apps:
- Winnovative HTML to PDF Converter
- Website2PDF
- HTML convert (almost good but no hyperlinking in the pdf)
- Adobe Acrobat - Web capture feature (perfect but not automatic)

If it's possible i don't want to use macros (recorded control) because the PC is used for other things as well, so the processes should run in the background.

The PC is a Windows 7 system and works in 24h a day (a nettop used as htpc and FTP server)

How to Automate Web Capturing / Web Archiving Tasks (html, pdf or mht)?

Thomas

Member

Similar threads

Forum statistics

Share this page