Working on a Microsoft Security Bulletin Parser and Statistics
Microsoft is the biggest vendor of software worldwide. There’s almost no-one in the western world, who haven’t heard of Microsoft Office and Microsoft Windows. As part of their support for their endless count of customers, Microsoft provides so-called “Security Bulletins” for their Desktop, Workstation and Server products since 2000. Since a few days, Microsoft released 649 such updates, starting with MS00-001 at 4. January 2000.
Every update contains both a description of the security vulnerability addressed, the affected software and a workaround / a link to the security update. All this data is pressed in single HTML pages within tables, neither in a parseable format nor available as e.g. XML download.
As i have stated several times that e.g. Microsoft Internet Explorer 6 is the most insecure browser, i wanted to proof what i guessed. So i spend two hours this day to build a parser for the Microsoft Security bulletin data. My latest status is that 1) i’ve been able to crawl all data, 2) cleaned most of the HTML content and therefore reduced the size of the data already by 30% and 3) extracted the title and the description of the Security Bulletin.
Next step will be to extract the tables that contain all affected software and affected operating systems. As the scheme for that changed over the last years (including a change in the information provided), it will be quite tough to get the complete data for all 649 updates. Once the data exists in a parseable format, i will make some nice statistics and present them right here.
First results show that there is a lot of standard software, e.g. Office 2000, Internet Explorer 6 and Windows XP and Server. But i’ve also already seen some exotic things like CAPICOM, Greetings 2000 and more. Final results will show a complete list of software that has been updated and the total counts, maybe also the over-time spread of updates in graphs.
EDIT:
What i actually forgot to write is – of course – what software i’m using: The whole parser is split up between several PHP classes, a filesystem based data storing and a lot of Regular Expressions.
Sources:
(1) http://www.microsoft.com/technet/security/current.aspx