Forum Moderators: phranque

Message Too Old, No Replies

How do I pull specific data from my HTML FrontPage / Expression Web

         

littledobby

6:08 am on Jul 19, 2008 (gmt 0)

10+ Year Member



How do I pull specific data from my HTML FrontPage / Expression Web website & put in Excel or Text File?

I have an ecommerce database I'm trying to build in Excel for our website. I have never made a database-driven website before. We currently have a website with nearly 800 different product pages that we built manually using FrontPage 2003. We need to change our site completely so that it is database-driven / dynamically builds the product pages. I'm using Excel 2007 to at least try to compile the database, since I've already been using a large Excel workbook to keep track of product pricing, shipping costs, etc. I don't know PHP or ASP just yet, and I'm not sure which one I'll end up using. So I'm starting with what I know how to do - I'm decently familiar with using Excel. And I know I need to build the database at the very least before I can have someone help me figure out how to use PHP or ASP and pull from the database. I want to have the database built in Excel, then exported to CSV and uploaded to a MySQL database online that the PHP or ASP Programming pulls from to build the product pages.

As mentioned, I already have a website now (built using FrontPage 2003, but I've already ordered the upgrade for Expression Web 2 that I should receive very soon). For our current site, we have manually created every product page. I need to convert all those pre-made product pages to have their info. in this same database where I already have the items listed by rows with columns for each of the important things (Long Description with Bullets, Short Description, Title, Price, Shipping Charge, URL for large image, URL for small image, etc.).

I need a way to automate pulling data from our current site's HTML and putting it into an Excel or csv or txt file. There is a lot of info. in the HTML files I don't need (tables info., includes pages, etc.). I would especially need to pull the item title, item description, the bullets with their <ul><li></li></ul> tags (some products have 15-20 bullets of info.), the image URLs, and if possible, the metatag data I've already made.

The only way I know how to do this is to go into FrontPage or our site online for every single product page we already have made and copy / paste the product description into a cell in Excel and then copy / paste the html code for the product's bulleted list into the formula bar of Excel for the next cell of that row, and do the same for the images and the metatag data. Our current website has nearly 800 product pages right now. This will take forever to do it manually. Does anyone know of a way I can automate some or all of this giant task? (currently NOT using a DWT - every page was created manually). Any info. at all will be greatly appreciated. Thank you!

[edited by: phranque at 6:26 am (utc) on July 19, 2008]
[edit reason] No urls, please. See TOS [webmasterworld.com] [/edit]

SteveWh

9:57 pm on Jul 19, 2008 (gmt 0)

10+ Year Member



First, maybe try a web search for an application that extracts data from web pages for import into a database. This is probably a task many others have had to do, so maybe someone wrote a program to help import data to Joomla or one of the other big CMS's. That's probably a longshot, but it's worth a quick search.

Failing that, FrontPage and Expression Web aren't going to be any help on their own. This isn't so much a web design problem as a data manipulation problem. The source data is your web pages treated as text.

If the existing pages are similar to each other, the key to success is going be regular expressions that allow you to extract the data you want and leave behind what you don't.

The method that would give you complete control is to write a C++ program to do the extraction. The .NET Framework (library) provides good regular expressions capabilities. It would scan all your site folders for .htm files, use regular expressions to extract the bits you want from each file, and write the data to a tab-delimited text file which you would then import to Excel. The same could also be done in Visual Basic or C# (both using .NET), and even Perl or PHP (using their facilities, not .NET). You can get Express Editions of Microsoft C++, C#, and VB free online.

Before starting to write the program, you could experiment with a tool like GREP to test how successfully regex can be used to pull the desired data out of the files. Developing the correct regex will be the key, and the tricky part.

Edit:

A 2nd alternative that uses regex differently:

Make a copy of the website, and use a long series of FP or EW regular expressions search and replaces to remove from every file all the content that you don't need to keep. When you're finished, each file will only contain data destined for the database.

That still leaves the problem of getting the data into tab-delimited format, and then copying it from the multiple source files into one big text file.

If there is a Windows equivalent of the Linux SED program, that might do the reformatting, and a series of COPY commands (Windows or Linux) would concatenate the files into one big file.

Anyway, those are at least the types of solutions I think are called for here.

[edited by: SteveWh at 10:12 pm (utc) on July 19, 2008]

littledobby

2:44 am on Jul 20, 2008 (gmt 0)

10+ Year Member



I don't know anything about programming and have taught myself the bits of HTML by using FrontPage and solving issues that I came across in doing so. I like your suggestion here, "Make a copy of the website, and use a long series of FP or EW regular expressions search and replaces to remove from every file all the content that you don't need to keep. When you're finished, each file will only contain data destined for the database." I forgot about trying that - I was going to try something like that a long time ago. That might work to at least help me get rid of the junk. But then, like you said, getting all the data from each of the htm files into just one csv or Excel file isn't something I know how to do. But I will at least try the part about doing a Find and Replace for data I know I don't need. Thank you for your suggestions in the meantime!

gbs1230

3:10 pm on Oct 26, 2008 (gmt 0)

10+ Year Member



Hey there,

May I suggest you do this with Perl. It is specifically designed for web, regular expression matching and file writing.

[edited by: phranque at 12:10 am (utc) on Oct. 27, 2008]

littledobby

4:47 pm on Oct 26, 2008 (gmt 0)

10+ Year Member



We have never used Perl before. We kind of gave up, as web scrapers didn't seem to be what we needed, either. We ended up manually copying and pasting the info. into Excel ourselves, but we aren't done yet. We haven't done any of the items and there are a bunch we haven't done yet under another section entirely. But I know at this point we mostly need all the items. Thank you for your suggestions and any help you can provide. Manually copying and pasting is a stupid way to do it!

[edited by: eelixduppy at 11:35 pm (utc) on Oct. 26, 2008]
[edit reason] removed URL [/edit]

phranque

12:11 am on Oct 27, 2008 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



welcome to WebmasterWorld [webmasterworld.com], gbs1230!