مفقود بيانات الإسناد ثنائي - خيارات
استراتيجيات الاستنتاج للنتائج الثنائية المفقودة في التجارب المعشاة العنقودية الخلفية إن الاستنزاف، الذي يؤدي إلى فقدان البيانات، هو مشكلة شائعة في التجارب العشوائية العنقودية (كرتس)، حيث مجموعات عشوائية من المرضى بدلا من الأفراد العشوائية. وقد لا تكون الاستراتيجيات المعيارية المتعددة للإحالة (مي) مناسبة لفرض بيانات مفقودة من كرت نظرا لأنها تفترض بيانات مستقلة. في هذه الورقة، في ظل افتراض مفقود تماما عشوائيا والمتغريات التابعة المفقودة، قارنا ست استراتيجيات مي التي تمثل الارتباط بين الكتلة للنتائج الثنائية المفقودة في كرت مع استراتيجيات إسناد القياسية ونهج تحليل حالة كاملة باستخدام دراسة محاكاة . وقد نظرنا في ثلاث استراتيجيات داخل المجموعة وثلاثة استراتيجيات مشتركة بين المجموعات العنقودية للنتائج الثنائية المفقودة في كرت. الاستراتيجيات الثلاث داخل المجموعة مي هي طريقة الانحدار اللوجستي، طريقة درجات الميل، وسلسلة ماركوف مونتي كارلو (مسمك)، التي تطبق استراتيجيات مي القياسية داخل كل مجموعة. وتتمثل الاستراتيجيات الثلاث المشتركة بين المجموعات المتناهية الصغر في أسلوب نقاط الميل، ونهج الانحدار اللوجستي للآثار العشوائية (ري)، والانحدار اللوجستي مع الكتلة كتأثير ثابت. وبناء على تجربة تقييم ارتفاع ضغط الدم المجتمعية (تشات) التي لديها بيانات كاملة، قمنا بتصميم دراسة محاكاة للتحقيق في أداء استراتيجيات مي أعلاه. تأثير المعالجة المقدرة و 95 فترة الثقة (سي) من المعادلات تقدير المعادلات (جي) نموذج يستند إلى مجموعة البيانات الكاملة تشات هي 1.14 (0.76 1.70). عندما تكون 30 من النتائج الثنائية مفقودة تماما بشكل عشوائي، تظهر دراسة المحاكاة أن تأثيرات العلاج المقدرة و 95 سي المقابلة من نموذج جي هي 1.15 (0.76 1.75) إذا تم استخدام تحليل الحالة الكامل، 1.12 (0.72 1.73) إذا كان داخل الكتلة يتم استخدام الأسلوب مسمك، 1.21 (0.80 1.81) إذا تم استخدام الانحدار اللوجستي ري عبر الكتلة، و 1.16 (0.82 1.64) إذا تم استخدام الانحدار اللوجستي القياسي الذي لا يمثل التكتل. الاستنتاج عندما تكون النسبة المئوية للبيانات المفقودة منخفضة أو معامل الارتباط داخل المجموعة صغيرة، فإن النهج المختلفة لمعالجة البيانات المفقودة للنتائج الثنائية تولد نتائج مماثلة تماما. عندما تكون النسبة المئوية للبيانات المفقودة كبيرة، استراتيجيات مي القياسية، التي لا تأخذ في الاعتبار الترابط داخل الكتلة، التقليل من التباين في تأثير العلاج. ويبدو أن استراتيجيات مي (داخل المجموعة العنقودية والعبرية)، باستثناء استراتيجية الانحدار اللوجستي للتأثيرات العشوائية (مي)، التي تأخذ الترابط داخل الكتلة في الاعتبار، أكثر ملاءمة لمعالجة النتيجة المفقودة من كرت. وفي إطار نفس استراتيجية التخصيص ونسبة المفقودين، تكون تقديرات تأثير المعالجة من نماذج الانحدار اللوجستي في جي و ري متشابهة. 1. مقدمة يتم استخدام التجارب العشوائية العنقودية (كرتس)، حيث يتم عشوائيا مجموعات من المشاركين بدلا من الأفراد، في تعزيز الصحة والبحوث الخدمات الصحية 1. عندما يجب أن يدار المشاركون في نفس المكان، مثل المستشفى، المجتمع، أو ممارسة الطبيب الأسرة، وعادة ما يتم اعتماد هذه الاستراتيجية العشوائية للحد من التلوث المحتمل العلاج بين المشاركين التدخل والرقابة. كما أنها تستخدم عندما العشوائية مستوى الفردية قد تكون غير لائقة، غير أخلاقية، أو غير قابلة للتطبيق 2. والنتيجة الرئيسية للتصميم العشوائي العنقودية هي أنه لا يمكن افتراض أن المشاركين مستقلون بسبب تشابه المشاركين من نفس المجموعة. ويتم قياس هذا التشابه من خلال معامل الارتباط داخل المجموعة. وبالنظر إلى عنصرين من التغيرات في النتائج، بين التفاوتات بين المجموعات وبين المجموعات، يمكن تفسيرها على أنها نسبة التباين الكلي في النتيجة التي يمكن تفسيرها بالتغير بين المجموعات 3. ويمكن أيضا أن تفسر على أنها العلاقة بين النتائج لأي مشاركين في نفس المجموعة. وقد ثبت جيدا أن عدم مراعاة الارتباط بين المجموعات في التحليل يمكن أن يزيد من فرصة الحصول على نتائج ذات دلالة إحصائية ولكنها زائفة 4. قد يكون خطر الاستنزاف مرتفعا جدا في بعض كرتس بسبب عدم وجود اتصال مباشر مع المشاركين الفردية ومتابعة مطولة 5. بالإضافة إلى الأفراد المفقودين، قد تكون المجموعات بأكملها مفقودة، مما يزيد من تعقيد التعامل مع البيانات المفقودة في كرتس. ويتوقف أثر البيانات الناقصة على نتائج التحليل الإحصائي على الآلية التي تسببت في فقدان البيانات والطريقة التي تعالج بها. النهج الافتراضي في التعامل مع هذه المشكلة هو استخدام تحليل حالة كاملة (وتسمى أيضا حذف ليستويز)، أي استبعاد المشاركين مع البيانات المفقودة من التحليل. على الرغم من أن هذا النهج هو سهل الاستخدام وهو الخيار الافتراضي في معظم الحزم الإحصائية، فإنه قد يضعف إلى حد كبير القوة الإحصائية للمحاكمة ويمكن أن يؤدي أيضا إلى نتائج متحيزة اعتمادا على آلية البيانات الناقصة. وبصفة عامة، يمكن أن تندرج طبيعة أو نوع النقص في أربع فئات: مفقودة تماما عشوائيا (مكار)، مفقودة عشوائيا (مار)، متغرية (سد) مفقودة، مفقودة غير عشوائية (منار) 6. فهم هذه الفئات مهم لأن الحلول قد تختلف تبعا لطبيعة المفقودين. يعني مكار أن آلية البيانات المفقودة، أي احتمال المفقودين، لا تعتمد على البيانات الملحوظة أو غير المرصودة. وتشير آل من آليتي مار و سد إلى أن أسباب البيانات المفقودة ليست ذات صلة بالقيم المفقودة، ولكن قد تكون ذات صلة بالقيم الملاحظة. في سياق البيانات الطولية عندما تؤخذ القياسات التسلسلية لكل فرد، يعني مار أن احتمال وجود استجابة مفقودة في زيارة معينة يرتبط إما بالاستجابات الملحوظة في الزيارات السابقة أو المتغيرات، في حين أن المفقودين سد - حالة خاصة من مار - يعني أن احتمال وجود رد مفقود يعتمد فقط على المتغيرات. يعني منار أن احتمال فقدان البيانات يعتمد على البيانات غير مرصودة. ويحدث ذلك عادة عندما ينسحب الناس من الدراسة بسبب نتائج صحية سيئة أو جيدة. ويتمثل أحد أوجه التمييز الرئيسية بين هذه الفئات في أن الشبكة منار لا يمكن تجاهلها في حين أن الفئات الثلاث الأخرى (أي مكار أو سد أو مار) يمكن تجاهلها 7. وفي ظل ظروف الفقدان الذي يمكن تجاهله، يمكن أن تنتج استراتيجيات حسابية، مثل حساب المتوسط، أو سطح السفينة الساخن، أو المراقبة الأخيرة، أو تعددها (مي) - التي تحل محل كل قيمة مفقودة إلى قيمة واحدة أو عدة قيم معقولة - مجموعة بيانات كاملة ليست منحازة سلبا 8. 9. فالبيانات المفقودة التي لا يمكن تجاهلها هي أكثر صعوبة وتتطلب نهجا مختلفا 10. وثمة نهجان رئيسيان في معالجة النتائج المفقودة هما التحليلات القائمة على الاحتمالات والإحالة 10. في هذه الورقة، نركز على استراتيجيات مي، التي تأخذ في الاعتبار التباين أو عدم اليقين من البيانات المفقودة، لإعاقة النتيجة الثنائية المفقودة في كرتس. تحت افتراض مار، استراتيجيات مي تحل محل كل قيمة مفقودة مع مجموعة من القيم المعقولة لإنشاء مجموعات البيانات المحسوبة متعددة - تتفاوت عادة في عدد من 3 إلى 10 11. يتم تحليل هذه البيانات المتعددة المحسوبة باستخدام إجراءات قياسية للبيانات الكاملة. ثم يتم الجمع بين النتائج من مجموعات البيانات المحسوبة للاستدلال لتوليد النتيجة النهائية. تتوفر إجراءات مي القياسية في العديد من حزم البرامج الإحصائية القياسية مثل ساس (كاري، نك)، سبس (شيكاغو إيل)، و ستاتا (محطة الكلية، تكس). ومع ذلك، فإن هذه الإجراءات تفترض أن الملاحظات مستقلة وقد لا تكون مناسبة ل كرتس لأنها لا تأخذ في الاعتبار الترابط داخل الكتلة. على حد علمنا، تم إجراء تحقيق محدود على استراتيجيات إسناد المفقودين النتائج الثنائية أو النتائج الفئوية في كرتس. أفاد يي وكوك أساليب هامشية للبيانات المفقودة المفقودة من تصميم متفاوت المسافات 12. هونسبرجر إت آل. (13) ثلاث استراتيجيات لاستمرار البيانات المفقودة في كرت: 1) إجراء حساب متعدد يستعاض عن القيم المفقودة بقيم إعادة العينات المأخوذة من البيانات الملاحظة 2) إجراء متوسط يستند إلى اختبار مجموع رتبة ويلكوكسون يعين البيانات المفقودة في مجموعة التدخل مع أسوأ الرتب 3) إجراء حساب متعددة التي يتم استبدال القيم المفقودة من القيم المتوقعة من معادلة الانحدار. نيكسون وآخرون. (14) استراتيجيات لفرض نقاط نهاية مفقودة من بديل. في تحليل نتيجة مستمرة من تجربة التدخل المجتمعي لوقف الإقلاع عن التدخين (كوميت)، قام جرين وآخرون بتقسيم المشاركين الفرديين إلى مجموعات كانت أكثر تجانسا فيما يتعلق بالنتائج المتوقعة. وضمن كل طبقة، اعتبروا النتيجة المفقودة باستخدام البيانات المرصودة 15. 16. قارن تالجارد وآخرون 17 العديد من استراتيجيات حسابية مختلفة لفقدان النتائج المستمرة في كرت تحت افتراض مفقود تماما بشكل عشوائي. وتشمل هذه الاستراتيجيات احتساب متوسط الكتلة، داخل المجموعة مي باستخدام طريقة التمهيد التمهيدية (أب) التقريبية، تجميع مي باستخدام طريقة أب، الانحدار القياسي مي، والانحدار الآثار المختلطة مي. وكما أشار كينوارد وآخرون إلى أنه في حالة استخدام نموذج موضوعي، مثل النموذج المختلط الخطي المعمم، الذي يعكس هيكل البيانات، من المهم أن يعكس نموذج الحساب أيضا هذا الهيكل 18. وتتمثل أهداف هذه الورقة فيما يلي: (1) التحقيق في أداء استراتيجيات حسابية مختلفة للنتائج الثنائية المفقودة في كرت تحت نسب مئوية مختلفة من الناقصة، بافتراض وجود آلية مفقودة تماما في العشوائية أو المتغاير تعتمد على 2) مقارنة الاتفاق بين مجموعة البيانات كاملة ومجموعات البيانات المحسوبة التي تم الحصول عليها من استراتيجيات حسابية مختلفة 3) مقارنة متانة النتائج في إطار اثنين من أساليب التحليل الإحصائي شائعة الاستخدام: معادلات التقدير المعممة (جي)، والانحدار اللوجستي للآثار العشوائية (ري)، في إطار استراتيجيات حسابية مختلفة. 2. الطرق في هذه الورقة، ونحن نعتبر ثلاث داخل داخل المجموعة وثلاث استراتيجيات عبر العنقودية مي للنتائج الثنائية المفقودة في كرتس. أما الاستراتيجيات الثلاث داخل المجموعة، فهي أسلوب الانحدار اللوجستي، وطريقة درجات الميل، وطريقة الرصد الميكانيكي، وهي استراتيجيات معيارية في إطار كل مجموعة. استراتيجيات مي الثلاثية عبر العنقودية هي درجة الميل، وطريقة الانحدار اللوجستي للآثار العشوائية، والانحدار اللوجستي مع الكتلة كتأثير ثابت. استنادا إلى مجموعة البيانات كاملة من محاكمة تقييم ارتفاع ضغط الدم المجتمع (تشات)، أجرينا دراسة محاكاة للتحقيق في أداء استراتيجيات مي المذكورة أعلاه. استخدمنا إحصائيات كابا لمقارنة الاتفاق بين مجموعات البيانات المحسوبة ومجموعة البيانات الكاملة. واستخدمنا أيضا الآثار العلاجية المقدرة التي تم الحصول عليها من نموذج الانحدار اللوجستي جي و ري 19 لتقييم متانة النتائج في ظل النسب المئوية المختلفة للنتائج الثنائية المفقودة تحت افتراض مكار و سد مفقودة. 2.1. تحليل حالة كاملة باستخدام هذا النهج، يتم تضمين فقط المرضى الذين يعانون من البيانات المكتملة للتحليل، في حين يتم استبعاد المرضى الذين يعانون من البيانات المفقودة. وعندما تكون البيانات مكار، فإن نهج تحليل الحالة الكامل، باستخدام إما التحليل القائم على الاحتمالات مثل الانحدار اللوجستي ري، أو النموذج الهامشي مثل نهج جي، هو صالح لتحليل النتائج الثنائية من كرت لأن آلية البيانات المفقودة مستقلة عن النتيجة. وعندما تكون البيانات مفقودة في عداد المفقودين، يكون كلا من الانحدار اللوجستي ري والمنهج جي صالحا إذا تم تعديل المتغيرات المشتركة المعروفة المرتبطة بآلية البيانات المفقودة. ويمكن تنفيذه باستخدام جينمود و نلمكسد الإجراء في ساس. 2.2. العزو المتعدد القياسي بافتراض أن الملاحظات مستقلة، يمكننا تطبيق إجراءات مي القياسية التي توفرها أي برامج إحصائية قياسية مثل ساس. ثلاث طرق مي المستخدمة على نطاق واسع هي طريقة نموذج تنبؤية (طريقة الانحدار اللوجستي للبيانات الثنائية)، طريقة درجة الميل، وطريقة مكك 20. وبصفة عامة، يوصى باستخدام كل من طريقة درجات الميل وطريقة ال مكك لتقدير المتغير المستمر 21. ويقال إن مجموعة البيانات لديها نمط رتيبة مفقود عندما يكون القياس Y j مفقودا للفرد يعني أن جميع القياسات اللاحقة Y k. k غ j. كلها مفقودة للفرد. عندما تكون البيانات مفقودة في النمط المفقود الرتيبة، أي من النموذج التنبئي البارامتري والطريقة غير الصفية التي تستخدم درجات الميل أو طريقة ال مكك مناسبة 21. وبالنسبة إلى أنماط بيانات مفقودة تعسفية، يمكن استعمال أسلوب مكك الذي يفترض أن تكون طبيعية متعددة المتغيرات 10. يتم تنفيذ هذه الاستراتيجيات مي باستخدام مي، مياناليز، جنمود، والإجراءات نلمكسد في ساس بشكل منفصل لكل مجموعة التدخل. 2.2.1. طريقة الانحدار اللوجستي في هذا النهج يتم تركيب نموذج الانحدار اللوجستي باستخدام النتائج المرصودة والمتغيرات المشتركة 21. واستنادا إلى تقديرات المعلمات ومصفوفة التباين المشترك، يمكن بناء التوزيع التنبئي الخلفي للمعلمات. ثم يتم محاكاة نموذج انحدار لوجستي جديد من التوزيع التنبئي الخلفي للمعلمات ويستخدم لإعفاء القيم المفقودة. 2.2.2. طريقة درجات الميل درجة الميل هي الاحتمال الشرطي للافتقار إلى البيانات المرصودة. ويمكن تقديرها بواسطة وسيلة نموذج الانحدار اللوجستي مع نتيجة ثنائية تشير إلى ما إذا كانت البيانات مفقودة أم لا. ثم يتم تقسيم الملاحظات إلى عدد من الطبقات استنادا إلى درجات الميل هذه. ثم يطبق الإجراء أب 22 على كل طبقة. ويستخلص الحساب أب أولا من البيانات الملحوظة لإنشاء مجموعة بيانات جديدة، وهي تماثلية غير صفية لمعلمات الرسم من التوزيع التنبئي الخلفي للمعلمات، ثم تعادل القيم المحسوبة عشوائيا بالاستعاضة عن مجموعة البيانات الجديدة. 2.2.3. سلسلة ماركوف طريقة مونت كارلو باستخدام طريقة مسمك يتم رسم عينات عشوائية الزائفة من توزيع الاحتمالات المستهدفة 21. والتوزيع المستهدف هو التوزيع المشروط المشترك ل Y ميس و Y أوبس عندما تكون البيانات المفقودة ذات نمط غير رتيبي حيث يمثل Y ميس و Y أوبس البيانات الناقصة والبيانات الملاحظة على التوالي ويمثل المعلمات غير المعروفة. وتجري الطريقة مكك على النحو التالي: يستعاض عن Y ميس ببعض القيم المفترضة، ثم يحاكي من التوزيع الخلفي الكامل للبيانات الناتجة P (Y أوبس، Y ميس). اسمحوا (t) القيمة الحالية المحاكاة. يمكن عندئذ استخلاص Y ميس (t 1) من التوزيع التنبؤي المشروط Y m s s (t 1) P (Y m s s o b s s (t)). تكييف على Y سوء (ر 1). يمكن استخلاص القيمة التالية المحاكاة من توزيعها الكامل للبيانات الخلفي (t 1) P (Y o b s. Y m i s (t 1)). بتكرار الإجراء أعلاه، يمكننا توليد سلسلة ماركوف التي تتقارب في التوزيع إلى P (Y سوء، Y أوبس). هذه الطريقة جذابة لأنه يتجنب حساب تحليلي معقد للتوزيع الخلفي و Y سوء. ومع ذلك، فإن التقارب في التوزيع هو قضية يحتاج الباحثون إلى مواجهتها. وبالإضافة إلى ذلك، يستند هذا الأسلوب على افتراض طبيعية متعددة المتغيرات. عند استخدامه لفرض المتغيرات الثنائية، القيم المحسوبة يمكن أن تكون أي قيم حقيقية. وتتراوح معظم القيم المحسوبة بين 0 و 1، وبعضها خارج هذا النطاق. نحن جولة القيم المحسوبة إلى 0 إذا كان أقل من 0.5 و 1 خلاف ذلك. يتم تنفيذ هذه الطريقة حساب متعددة باستخدام إجراء مي في ساس. نحن نستخدم سلسلة واحدة وغير مفيدة قبل كل الافتراضات، وخوارزمية تعظيم التوقعات (إم) للعثور على تقديرات أقصى احتمال في النماذج البارامترية للبيانات غير مكتملة وتستمد تقديرات المعلمة من وضع الخلفي. وتعتبر التكرارات تقارب عندما يكون التغير في تقديرات المعلمة بين خطوات التكرار أقل من 0.0001 لكل معلمة. 2.3. اعتبارات متعددة داخل المجموعة لا تعد استراتيجيات مي غير مناسبة للتعامل مع البيانات الناقصة من كرت بسبب افتراض الرصدات المستقلة. وبالنسبة للإسناد داخل المجموعة، نقوم بتنفيذ المعيار مي الموصوف أعلاه باستخدام أسلوب الانحدار اللوجستي، وطريقة درجات الميل، وطريقة الرصد المتعدد بالتقسيم الشفري بشكل منفصل لكل مجموعة. وبالتالي، فإن القيم المفقودة تعزى إلى البيانات المرصودة في نفس المجموعة مثل القيم المفقودة. وبالنظر إلى أن المواضيع داخل المجموعة نفسها أكثر احتمالا أن تكون مشابهة لبعضها البعض من تلك الموجودة في مجموعات مختلفة، يمكن اعتبار إسناد المجموعة ضمن استراتيجية لفرض القيم المفقودة لمراعاة الترابط بين المجموعات. يتم تنفيذ هذه الاستراتيجيات مي باستخدام مي، مياناليز، جنمود، والإجراءات نلمكسد في ساس. 2.4. اعتبارات متعددة عبر المجموعات 2.4.1. طريقة درجات الميل مقارنة بالإحساب المتعدد المعياري باستخدام طريقة نقاط الميل، أضفنا العنقود كواحد من المتغيرات المتراكمة للحصول على درجة الميل لكل ملاحظة. ونتيجة لذلك، من المرجح أن يتم تصنيف المرضى داخل نفس المجموعة إلى نفس درجة درجة الميل. ولذلك، يؤخذ الترابط داخل الكتلة في الاعتبار عند تطبيق إجراء أب داخل كل طبقة لتوليد القيم المحسوبة للبيانات الناقصة. يتم تنفيذ هذه الاستراتيجية إسناد متعددة باستخدام مي، مياناليز، جنمود، والإجراءات نلمكسد في ساس. 2.4.2. الانحدار اللوجستي للآثار العشوائية بالمقارنة مع النموذج التنبئي باستخدام أسلوب الانحدار اللوجستي القياسي، نفترض أن النموذج الثنائي يتم نمذجه بواسطة نموذج لوجستي للتأثيرات العشوائية: سجله (بيأر (إي إجل 1)) X إيجل U إيغ حيث Y إيجل هو نتائج ثنائية للمريض l في المجموعة j في مجموعة التدخل i X إيجل هي مصفوفة المتغيرات المشتركة على مستوى الفرد أو مستوى المجموعة الملاحظ تماما، U إج N (0. B 2) يمثل التأثير العشوائي على مستوى الكتلة، و B 2 تمثل الفرق بين المجموعات. ويمكن تقدير B 2 عند تركيب نموذج الانحدار اللوجستي المؤثرات العشوائية باستخدام النتائج المرصودة والمتغيرات المشتركة. وتحصل استراتيجية مي باستخدام طريقة الانحدار اللوجستي المؤثرات العشوائية على القيم المحسوبة في ثلاث خطوات: (1) تناسب نموذج الانحدار اللوجستي للآثار العشوائية كما هو موضح أعلاه باستخدام النتيجة المرصودة والمتغيرات المتناظرة. واستنادا إلى التقديرات و B المتحصل عليها من الخطوة (1) ومصفوفة التباين المشترك المرتبطة بها، بناء التوزيع التنبئي الخلفي لهذه المعلمات. تناسب الانحدار اللوجستي تأثيرات عشوائية جديدة باستخدام المعلمات محاكاة من التوزيع التنبؤية الخلفية والمتغيرات المشار إليها للحصول على النتيجة المفقودة المفترضة. وتراعي استراتيجية مي التي تستخدم الانحدار اللوجستي للآثار العشوائية التباين بين التجمعات، الذي يتجاهل في استراتيجية مي باستعمال الانحدار اللوجستي القياسي، وبالتالي قد يكون صالحا لفرض بيانات ثنائية مفقودة في كرت. نحن نقدم رمز ساس لهذه الطريقة في الملحق أ. الانحدار اللوجستي مع الكتلة كأثر ثابت بالمقارنة مع النموذج التنبئي باستخدام أسلوب الانحدار اللوجستي القياسي، نضيف الكتلة كأثر ثابت لحساب تأثير التجميع. يتم تنفيذ هذه الاستراتيجية إسناد متعددة باستخدام مي، مياناليز، جنمود، والإجراءات نلمكسد في ساس. 3. دراسة المحاكاة 3.1. تجربة تقييم ارتفاع ضغط الدم المجتمعي تم الإبلاغ عن دراسة تشات بالتفصيل في أماكن أخرى 23. وباختصار، كانت تجربة معشاة ذات شواهد جماعية تهدف إلى تقييم فعالية عيادات ضغط الدم المعتمدة على الصيدلة التي يقودها اختصاصيو التوعية الصحية من الأقران، مع التغذية المرتدة لأطباء الأسرة بشأن إدارة ومراقبة ضغط الدم بين المرضى الذين يبلغون 65 عاما أو اكبر سنا. و فب هو وحدة من العشوائية. تلقى المرضى من نفس فب نفس التدخل. وفي المجموع، شارك 28 مشاركا في هذه الدراسة. تم تخصيص 14 عشوائيا للتدخل (عيادات بي بي الصيدلية) و 14 لمجموعة السيطرة (لا عيادات بب عرضت). وتم اختيار خمسة وخمسين مريضا عشوائيا من كل قائمة من قوائم فب. لذلك، شارك 1540 مريضا في الدراسة. جميع المرضى المؤهلين في كل من مجموعة التدخل والسيطرة تلقوا الخدمة الصحية المعتادة في مكتب الشراكة الخاصة بهم. تمت دعوة المرضى في الممارسات المخصصة لمجموعة التدخل لزيارة العيادات المجتمعية بب. ساعد المثقفون الصحيون الأقران المرضى على قياس ضغط الدم الخاص بهم ومراجعة عوامل الخطر القلبية الوعائية. أجرى الممرضون البحوث خط الأساس ونهاية المحاكمة (12 شهرا بعد العشوائية) مراجعة السجلات الصحية للمرضى 1540 الذين شاركوا في الدراسة. وكانت النتيجة الأولية للدراسة تشات نتيجة ثنائية تشير إلى ما إذا كان المرضى الذين تم التحكم بب أو لا في نهاية المحاكمة. تم السيطرة على المرضى بب إذا كان في نهاية المحاكمة، الانقباضي بب 140 مم زئبق والانبساطي بب 90 ملم زئبق للمريض دون مرض السكري أو تلف الجهاز المستهدف، أو الانقباضي بب 130 ملم زئبق والانبساطي بب 80 ملم زئبق للمريض مع مرض السكري أو تلف الجهاز المستهدف . إلى جانب مجموعة التدخل، شملت التنبؤات الأخرى التي تم بحثها في هذه الورقة العمر (المتغير المستمر)، والجنس (المتغير الثنائي)، ومرض السكري عند خط الأساس (المتغير الثنائي)، وأمراض القلب عند خط الأساس (المتغير الثنائي)، وعما إذا كان المرضى الذين تم التحكم بب في خط الأساس ( متغير ثنائي). في نهاية المحاكمة، تم السيطرة على 55 مريضا بب. وبدون تضمين أي متنبئات أخرى في النموذج، كانت آثار المعالجة وفترات الثقة 95 (سي) المقدرة من نموذج جي و ري هي 1.14 (0.72، 1.80) و 1.10 (0.65، 1.86) على التوالي. وبلغت القيمة التقديرية للمحكمة الجنائية الدولية 0.077. بعد التعديل للمتغيرات المذكورة أعلاه كانت تأثيرات العلاج و سي التي تقدرها من جي و ري نموذج 1.14 (0.76، 1.70) و 1.12 (0.72، 1.76)، على التوالي. وبلغت القيمة التقديرية للمحكمة الجنائية الدولية 0.055. وبما أنه لا توجد بيانات مفقودة في مجموعة بيانات تشات، فإنه يوفر لنا منصة ملائمة لتصميم دراسة محاكاة لمقارنة القيم المحسوبة والقيم الملحوظة ومواصلة التحقيق في أداء استراتيجيات حسابية متعددة مختلفة في ظل مختلف آليات البيانات المفقودة والنسب المئوية لعدم وجود . 3.2. إنشاء مجموعة بيانات مع نتائج ثنائية مفقودة باستخدام مجموعة بيانات دراسة تشات، قمنا بالتحقيق في أداء استراتيجيات مي المختلفة للنتائج الثنائية المفقودة استنادا إلى آليات مكار و سد. في ظل افتراض مكار، أنشأنا مجموعة البيانات مع نسبة معينة من النتيجة الثنائية المفقودة، مما يدل على ما إذا كان يتم التحكم بب أو لم يكن في نهاية المحاكمة لكل مريض. كان احتمال المفقودين لكل مريض عشوائيا تماما، أي أن احتمال المفقودين لا يعتمد على أي بيانات تشات مرصودة أو غير مرصودة. تحت افتراض سد في عداد المفقودين، اعتبرنا الجنس، مجموعة العلاج، سواء المرضى الذين يسيطرون بب أو لا في خط الأساس، والتي كانت مرتبطة عادة مع التسرب في التجارب السريرية والدراسات الرصدية 24 26، ارتبطت مع احتمال المفقودين. افترضنا أيضا أن المرضى الذكور أكثر عرضة 1.2 مرة من المرضى المفقودين النتيجة التي تم تخصيصها لمجموعة السيطرة كانت 1.3 مرات أكثر عرضة لنقص المفاجئ المرضى الذين لم يتم التحكم بب في خط الأساس كانت 1.4 مرات أكثر عرضة لنقص النتيجة من المرضى الذين تم التحكم بب في خط الأساس. 3.3. تصميم دراسة المحاكاة أولا قارنا الاتفاق بين قيم المتغير الناتج المحسوب والقيم الحقيقية لمتغير النتائج باستخدام إحصائيات كابا. الإحصاء كابا هو الإحصاء الأكثر شيوعا لتقييم الاتفاق بين اثنين من المراقبين أو الأساليب التي تأخذ في الاعتبار حقيقة أنها سوف توافق في بعض الأحيان أو لا توافق ببساطة عن طريق الصدفة 27. ويتم حسابها على أساس الفرق بين مقدار الاتفاق الموجود فعليا مقارنة بمقدار الاتفاق المتوقع وجوده بالصدفة وحدها. A كابا من 1 يشير إلى اتفاق الكمال، و 0 يشير إلى اتفاق يعادل الصدفة. وقد استخدم الإحصاء كابا على نطاق واسع من قبل الباحثين لتقييم أداء تقنيات حسابية مختلفة على فرض البيانات الفئوية المفقودة 28. 29. ثانيا، في ظل مكار و سد مفقود، قارنا تقديرات تأثير العلاج من ري و جي أساليب تحت السيناريوهات التالية: 1) استبعاد القيم الناقصة من التحليل، أي تحليل حالة كاملة 2) تطبيق استراتيجيات احتساب متعددة القياسية التي لا تأخذ العلاقة بين المجموعات في الحسبان 3) تطبيق استراتيجيات الحزم داخل المجموعة و 4) تطبيق استراتيجيات توزيع المجموعات. قمنا بتصميم دراسة محاكاة وفقا للخطوات التالية. 1) ولدت 5 و 10 و 15 و 20 و 30 و 50 نتائج مفقودة في إطار كل من مكار و سد مفقودة الافتراض. وقد اختيرت هذه المبالغ من المفقودين لتغطية نطاق النقص المحتمل في الممارسة 30. تطبيق استراتيجيات احتساب متعددة أعلاه لتوليد م 5 مجموعات البيانات. وفقا لروبين، الكفاءة النسبية لل مي لا تزيد كثيرا عند توليد أكثر من 5 مجموعات البيانات المحسوبة 11. حساب الإحصاء كابا لتقييم الاتفاق بين قيم متغير النتيجة المحسوبة والقيم الحقيقية للمتغير النتيجة. الحصول على تقدير تأثير العلاج واحد من خلال الجمع بين تقديرات تأثير من 5 مجموعات البيانات المحسوبة باستخدام جي و ري نموذج. كرر الخطوات الأربع المذكورة أعلاه لمدة 1000 مرة، أي اتخاذ 1000 تشغيل المحاكاة. حساب إحصاء كابا العام عن طريق حساب متوسط إحصائية كابا من 1000 محاكاة. حساب تأثير العلاج العام والخطأ المعياري عن طريق المتوسطات آثار العلاج وأخطاءها القياسية من 1000 تشغيل المحاكاة. 4 - النتائج 4-1. النتائج عندما تكون البيانات مفقودة تماما بشكل عشوائي مع نسبة 5 أو 10 أو 15 أو 20 أو 30 أو 50 من المفقودين في إطار افتراض مكار، فإن كابا المقدرة لجميع استراتيجيات الحساب المختلفة تزيد قليلا عن 0.95 و 0.90 و 0.85 و 0.80 و 0.70 و 0.50 على التوالي. وترد تقديرات كابا لاستراتيجيات حسابية مختلفة بنسب مختلفة من النتائج المفقودة تحت افتراض مكار بالتفصيل في الجدول 1. إحصاءات كابا لاستراتيجيات حساب مختلفة عندما المفقودين عشوائيا تماما تأثير المعاملة المقدرة من الانحدار اللوجستي الآثار العشوائية عندما 30 البيانات هو كوفاريات تعتمد مفقودة. 5. المناقشة في هذه الورقة، وبموجب افتراض مكار و سد في عداد المفقودين، قارنا ست استراتيجيات مي التي تمثل الارتباط بين الكتلة للنتائج الثنائية المفقودة في كرت مع استراتيجيات احتساب المعيارية ونهج تحليل حالة كاملة باستخدام دراسة المحاكاة. وتظهر نتائجنا أنه عندما تكون النسبة المئوية للبيانات المفقودة منخفضة أو بين معامل الارتباط داخل المجموعة صغيرة، فإن استراتيجيات حساب مختلفة أو نهج تحليل الحالة الكامل تولد نتائج مماثلة تماما. ثانيا، استراتيجيات مي القياسية، التي لا تأخذ في الاعتبار الترابط داخل الكتلة، التقليل من التباين في آثار العلاج. ولذلك، فإنها قد تؤدي إلى استنتاج هام ولكن إحصائية زائفة عند استخدامها للتعامل مع البيانات المفقودة من كرتس. ثالثا، في ظل افتراض مفار و سد مفقودة، تقديرات نقطة (أور) متشابهة تماما عبر نهج مختلفة للتعامل مع البيانات الناقصة باستثناء الآثار العشوائية الانحدار اللوجستي استراتيجية مي. رابعا، تأخذ استراتيجيات مي على مستوى المجموعة وفيما بين المجموعات بعين الاعتبار الترابط بين المجموعات وتوفر الكثير من تقديرات التأثير العلاجي المحافظ مقارنة باستراتيجيات مي التي تتجاهل تأثير التجميع. وخامسا، تؤدي استراتيجيات إسناد المجموعات داخل المجموعة إلى توسيع نطاق الاستثمار الدولي من استراتيجيات إسناد المجموعات، ولا سيما عندما تكون نسبة المفقودين مرتفعة. قد يكون هذا بسبب إستراتيجيات إسناد داخل المجموعة فقط استخدام جزء صغير من البيانات، مما يؤدي إلى اختلاف كبير في تأثير العلاج المقدر. سادسا، أكبر كابا المقدرة، مما يشير إلى اتفاق أعلى بين القيم المحسوبة والقيم الملحوظة، ويرتبط مع أداء أفضل لاستراتيجيات مي من حيث توليد تأثير العلاج المقدرة و 95 سي أقرب إلى تلك التي تم الحصول عليها من مجموعة بيانات تشات كاملة. سابعا، في ظل استراتيجية التخصيص نفسها ونسبة المفقودين، فإن تقديرات تأثير المعالجة من نماذج الانحدار اللوجستي جي و ري متشابهة. على حد علمنا، وقد تم القيام بعمل محدود على مقارنة استراتيجيات مختلفة متعددة إسقاط للنتائج الثنائية المفقودة في كرتس. تالجارد وآخرون 17 مقارنة أربع استراتيجيات مي (تجمع أب، داخل الكتلة أب، الانحدار القياسي، الانحدار الآثار المختلطة) ل فقدان النتيجة المتواصلة في كرت عندما مفقود هو عشوائيا تماما. نتائجهم هي مماثلة لنا. وتجدر الإشارة إلى أن الاستراتيجيات داخل المجموعات العنقودية يمكن أن تكون قابلة للتطبيق فقط عندما يكون حجم المجموعة كبيرا بما فيه الكفاية ونسبة المفقودين صغيرة نسبيا. في دراسة تشات، كان هناك 55 مريضا في كل مجموعة والتي وفرت ما يكفي من البيانات لتنفيذ استراتيجيات إسناد داخل المجموعة باستخدام درجة الميل وطريقة مكك. ومع ذلك، فشلت طريقة الانحدار اللوجستي عندما كانت نسبة المفقودين مرتفعة. وكان ذلك لأنه عند توليد نسبة كبيرة (20) من النتيجة المفقودة، تم محاكاة كل المرضى الذين يعانون من نتيجة ثنائية من 0 كما مفقود لبعض المجموعات. ولذلك، فشل نموذج الانحدار اللوجستي لهذه المجموعات معينة. وبالإضافة إلى ذلك، تظهر نتائجنا أن النهج الكامل لتحليل الحالة يؤدي بشكل جيد نسبيا حتى مع 50 مفقودة. ونحن نعتقد أنه بسبب الترابط داخل الكتلة، لا يتوقع المرء أن القيم المفقودة لها تأثير كبير إذا كانت نسبة كبيرة من الكتلة لا تزال موجودة. ومع ذلك، المزيد من التحقيق حول هذه المسألة باستخدام دراسة محاكاة سيكون من المفيد للإجابة على هذا السؤال. وتبين نتائجنا أن استراتيجية الانحدار اللوجستي للتأثيرات العشوائية عبر المجموعات تؤدي إلى تقدير يحتمل أن يكون متحيزا، خاصة عندما تكون نسبة المفقودين مرتفعة. وكما ذكرنا في القسم 2-4-2، نفترض أن التأثيرات العشوائية على مستوى الكتلة تتبع التوزيع الطبيعي، أي U i j n (0. B 2). وقد أظهر الباحثون أن سوء تحديد الشكل التوزيعي له تأثير ضئيل على الاستدلالات حول الآثار الثابتة 31. إن افتراض أن التوزيع العشوائي للآثار مستقل عن حجم الكتلة قد يؤثر على الاستدلالات حول الاعتراض، ولكنه لا يؤثر تأثيرا خطيرا على مؤشرات الانحدار. غير أن الافتراض غير الصحيح لتوزيع التأثيرات العشوائية مستقل عن المتغيرات المشتركة قد يؤثر تأثيرا خطيرا على استنتاجات معلمات الانحدار 32. ويمكن ربط متوسط توزيع التأثيرات العشوائية بالتغير المتغير، أو يمكن أن يرتبط تباين توزيع التأثيرات العشوائية بمتغير مشترك لمجموعة البيانات، مما قد يفسر التحيز المحتمل من استراتيجية الانحدار اللوجستي للتأثيرات العشوائية عبر العنقود. وعلى النقيض من ذلك، فإن استراتيجية الترجيع في الانحدار اللوجستي مع المجموعة كتأثير ثابت لها أداء أفضل. ومع ذلك، قد يتم تطبيقه فقط عندما يكون حجم الكتلة كبيرا بما فيه الكفاية لتوفير تقدير مستقر لتأثير الكتلة. For multiple imputation, the overall variance of the estimated treatment effect consists of two parts: within imputation variance U . and between imputation variance B . The total variance T is calculated as T U (1 1 m ) B . where m is the number of imputed datasets 10 . Since standard MI strategies ignore the between cluster variance and fail to account for the intra-cluster correlation, the within imputation variance may be underestimated, which could lead to underestimation of the total variance and consequently the narrower confidence interval. In addition, the adequacy of standard MI strategies depends on the ICC. In our study, the ICC of the CHAT dataset is 0.055 and the cluster effect in the random-effects model is statistically significant. Among the three imputation methods: predictive model (logistic regression method), propensity score method, and MCMC method, the latter is most popular method for multiple imputation of missing data and is the default method implemented in SAS. Although this method is widely used to impute binary and polytomous data, there are concerns about the consequences of violating the normality assumption. Experience has repeatedly shown that multiple imputation using MCMC method tends to be quite robust even when the real data depart from the multivariate normal distribution 20 . Therefore, when handling the missing binary or ordered categorical variables, it is acceptable to impute under a normality assumption and then round off the continuous imputed values to the nearest category. For example, the imputed values for the missing binary variable can be any real value rather than being restricted to 0 and 1. We rounded the imputed values so that values greater than or equal to 0.5 were set to 1, and values less than 0.5 were set to 0 34 . Horton et al 35 showed that such rounding may produce biased estimates of proportions when the true proportion is near 0 or 1, but does well under most other conditions. The propensity score method is originally designed to impute the missing values on the response variables from the randomized experiment with repeated measures 21 . Since it uses only the covariate information associated with the missingness but ignores the correlation among variables, it may produce badly biased estimates of regression coefficients when data on predictor variables are missing. In addition, with small sample sizes and a relatively large number of propensity score groups, application of the ABB method is problematic, especially for binary variables. In this case, a modified version of ABB should be conducted 36 . There are some limitations that need to be acknowledged and addressed regarding the present study. First, the simulation study is based on a real dataset, which has a relatively large cluster size and small ICC. Further research should investigate the performance of different imputation strategies at different design settings. Second, the scenario of missing an entire cluster is not investigated in this paper. The proposed within-cluster and across-cluster MI strategies may not apply to this scenario. Third, we investigate the performance of different MI strategies assuming missing data mechanism of MCAR and CD missing. Therefore, results cannot be generalized to MAR or MNAR scenarios. Fourth, since the estimated treatment effects are similar under different imputation strategies, we only presented the OR and 95 CI for each simulation scenario. However, estimates of standardized bias and coverage would be more informative and would also provide a quantitative guideline to assess the adequacy of imputes 37 . 6. Conclusions When the percentage of missing data is low or intra-cluster correlation coefficient is small, different imputation strategies or complete case analysis approach generate quite similar results. When the percentage of missing data is high, standard MI strategies, which do not take into account the intra-cluster correlation, underestimate the variance of the treatment effect. Within-cluster and across-cluster MI strategies (except for the random-effects logistic regression MI strategy), which take the intra-cluster correlation into account, seem to be more appropriate to handle the missing outcome from CRTs. Under the same imputation strategy and percentage of missingness, the estimates of the treatment effect from GEE and RE logistic regression models are similar. Appendix A: SAS code for across-cluster random-effects logistic regression method let maximum 1000 ods listing close proc nlmixed data mcaramppercentampindex cov parms b0 -0.0645 bgroup -0.1433 bdiabbase -0.04 bhdbase 0.1224 bage -0.0066 bbasebpcontrolled 1.1487 bsex 0.0873 s2u 0.5 Population Health Research Institute, Hamilton Health Sciences References Campbell MK, Grimshaw JM: Cluster randomised trials: time for improvement. The implications of adopting a cluster design are still largely being ignored. BMJ. 1998, 317 (7167): 1171-1172. View Article PubMed PubMed Central Google Scholar COMMIT Research Group: Community Intervention trial for Smoking Cessation (COMMIT): 1. Cohort results from a four-year community intervention. Am J Public Health. 1995, 85: 183-192. 10.2105AJPH.85.2.183. View Article Google Scholar Donner A, Klar N: Design and Analysis of Cluster Randomisation Trials in Health Research. 2000, London: Arnold Google Scholar Cornfield J: Randomization by group: a formal analysis. Am J Epidemiol. 1978, 108 (2): 100-102. PubMed Google Scholar Donner A, Brown KS, Brasher P: A methodological review of non-therapeutic intervention trials employing cluster randomization, 1979-1989. Int J Epidemiol. 1990, 19 (4): 795-800. 10.1093ije19.4.795. View Article PubMed Google Scholar Rubin DB: Inference and missing data. Biometrika. 1976, 63: 581-592. 10.1093biomet63.3.581. View Article Google Scholar Allison PD: Missing Data. 2001, SAGE Publications Inc Google Scholar Schafer JL, Olsen MK: Multiple imputation for multivariate missing-data problems: a data analysts perspective. Multivariate Behavioral Research. 1998, 33: 545-571. 10.1207s15327906mbr33045. View Article PubMed Google Scholar McArdle JJ: Structural factor analysis experiments with incomplete data. Multivariate Behavioral Research. 1994, 29: 409-454. 10.1207s15327906mbr29045. View Article PubMed Google Scholar Little RJA, Rubin DB: Statistical Analysis with missing data. 2002, New York: John Wiley, Second Google Scholar Rubin DB: Multiple Imputation for Nonresponse in Surveys. 1987, New York, NY. John Wiley amp Sons, Inc View Article Google Scholar Yi GYY, Cook RJ: Marginal Methods for Incomplete Longitudinal Data Arising in Clusters. Journal of the American Statistical Association. 2002, 97 (460): 1071-1080. 10.1198016214502388618889. View Article Google Scholar Hunsberger S, Murray D, Davis CE, Fabsitz RR: Imputation strategies for missing data in a school-based multi-centre study: the Pathways study. Stat Med. 2001, 20 (2): 305-316. 10.10021097-0258(20010130)20:2lt305::AID-SIM645gt3.0.CO2-M. View Article PubMed Google Scholar Nixon RM, Duffy SW, Fender GR: Imputation of a true endpoint from a surrogate: application to a cluster randomized controlled trial with partial information on the true endpoint. BMC Med Res Methodol. 2003, 3: 17-10.11861471-2288-3-17. View Article PubMed PubMed Central Google Scholar Green SB, Corle DK, Gail MH, Mark SD, Pee D, Freedman LS, Graubard BI, Lynn WR: Interplay between design and analysis for behavioral intervention trials with community as the unit of randomization. Am J Epidemiol. 1995, 142 (6): 587-593. PubMed Google Scholar Green SB: The advantages of community-randomized trials for evaluating lifestyle modification. Control Clin Trials. 1997, 18 (6): 506-13. 10.1016S0197-2456(97)00013-5. discussion 514-6 View Article PubMed Google Scholar Taljaard M, Donner A, Klar N: Imputation strategies for missing continuous outcomes in cluster randomized trials. Biom J. 2008, 50 (3): 329-345. 10.1002bimj.200710423. View Article PubMed Google Scholar Kenward MG, Carpenter J: Multiple imputation: current perspectives. Stat Methods Med Res. 2007, 16 (3): 199-218. 10.11770962280206075304. View Article PubMed Google Scholar Dobson AJ: An introduction to generalized linear models. 2002, Boca Raton: Chapman amp HallCRC, 2 Google Scholar Schafer JL: Analysis of Incomplete Multivariate Data. 1997, London: Chapman and Hall View Article Google Scholar SAS Publishing: SASSTAT 9.1 Users Guide: support. sasdocumentationonlinedoc91pdfsasdoc91statug7313.pdf Rubin DB, Schenker N: Multiple imputation for interval estimation from simple random samples with ignorable nonresponse. Journal of the American Statistical Association. 1986, 81 (394): 366-374. 10.23072289225. View Article Google Scholar Ma J, Thabane L, Kaczorowski J, Chambers L, Dolovich L, Karwalajtys T, Levitt C: Comparison of Bayesian and classical methods in the analysis of cluster randomized controlled trials with a binary outcome: the Community Hypertension Assessment Trial (CHAT). BMC Med Res Methodol. 2009, 9: 37-10.11861471-2288-9-37. View Article PubMed PubMed Central Google Scholar Levin KA: Study design VII. Randomised controlled trials. Evid Based Dent. 2007, 8 (1): 22-23. 10.1038sj. ebd.6400473. View Article PubMed Google Scholar Matthews FE, Chatfield M, Freeman C, McCracken C, Brayne C, MRC CFAS: Attrition and bias in the MRC cognitive function and ageing study: an epidemiological investigation. BMC Public Health. 2004, 4: 12-10.11861471-2458-4-12. View Article PubMed PubMed Central Google Scholar Ostbye T, Steenhuis R, Wolfson C, Walton R, Hill G: Predictors of five-year mortality in older Canadians: the Canadian Study of Health and Aging. J Am Geriatr Soc. 1999, 47 (10): 1249-1254. View Article PubMed Google Scholar Viera AJ, Garrett JM: Understanding interobserver agreement: the kappa statistic. Fam Med. 2005, 37 (5): 360-363. PubMed Google Scholar Laurenceau JP, Stanley SM, Olmos-Gallo A, Baucom B, Markman HJ: Community-based prevention of marital dysfunction: multilevel modeling of a randomized effectiveness study. J Consult Clin Psychol. 2004, 72 (6): 933-943. 10.10370022-006X.72.6.933. View Article PubMed Google Scholar Shrive FM, Stuart H, Quan H, Ghali WA: Dealing with missing data in a multi-question depression scale: a comparison of imputation methods. BMC Med Res Methodol. 2006, 6: 57-10.11861471-2288-6-57. View Article PubMed PubMed Central Google Scholar Elobeid MA, Padilla MA, McVie T, Thomas O, Brock DW, Musser B, Lu K, Coffey CS, Desmond RA, St-Onge MP, Gadde KM, Heymsfield SB, Allison DB: Missing data in randomized clinical trials for weight loss: scope of the problem, state of the field, and performance of statistical methods. PLoS One. 2009, 4 (8): e6624-10.1371journal. pone.0006624. View Article PubMed PubMed Central Google Scholar McCulloch CE, Neuhaus JM: Prediction of Random Effects in Linear and Generalized Linear Models under Model Misspecification. Biometrics. Neuhaus JM, McCulloch CE: Separating between - and within-cluster covariate effects using conditional and partitioning methods. Journal of the Royal Statistical Society. 2006, 859-872. Series B, 68 Heagerty PJ, Kurland BF: Misspecified maximum likelihood estimates and generalised linear mixed models. Biometrika. 2001, 88 (4): 973-985. 10.1093biomet88.4.973. View Article Google Scholar Christopher FA: Rounding after multiple imputation with Non-binary categorical covariates. SAS Focus Session SUGI. 2004, 30: Google Scholar Horton NJ, Lipsitz SR, Parzen M: A potential for bias when rounding in multiple imputation. American Statistician. 2003, 229-232. 10.11980003130032314. 57 Li X, Mehrotra DV, Barnard J: Analysis of incomplete longitudinal binary data using multiple imputation. Stat Med. 2006, 25 (12): 2107-2124. 10.1002sim.2343. View Article PubMed Google Scholar Collins LM, Schafer JL, Kam CM: A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychol Methods. 2001, 6 (4): 330-351. 10.10371082-989X.6.4.330. View Article PubMed Google Scholar Pre-publication history Ma et al licensee BioMed Central Ltd. 2011 This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( creativecommons. orglicensesby2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Multiple Imputation in Stata: Imputing This is part four of the Multiple Imputation in Stata series. For a list of topics covered by this series, see the Introduction . This section will talk you through the details of the imputation process. Be sure youve read at least the previous section, Creating Imputation Models. so you have a sense of what issues can affect the validity of your results. Example Data To illustrate the process, well use a fabricated data set. Unlike those in the examples section, this data set is designed to have some resemblance to real world data. female (binary) race (categorical, three values) urban (binary) edu (ordered categorical, four values) exp (continuous) wage (continuous) Missingness . Each value of all the variables except female has a 10 chance of being missing completely at random, but of course in the real world we wont know that it is MCAR ahead of time. Thus we will check whether it is MCAR or MAR (MNAR cannot be checked by looking at the observed data) using the procedure outlined in Deciding to Impute : unab numvars: unab missvars: urban-wage misstable sum, gen(miss) foreach var of local missvars local covars: list numvars - var display newline(3) quotlogit missingness of var on covarsquot logit missvar covars foreach nvar of local covars display newline(3) quotttest of nvar by missingness of varquot ttest nvar, by(missvar) See the log file for results. Our goal is to regress wages on sex, race, education level, and experience. To see the quotrightquot answers, open the do file that creates the data set and examine the gen command that defines wage. Complete code for the imputation process can be found in the following do file: The imputation process creates a lot of output. Well put highlights in this page, however, a complete log file including the associated graphs can be found here: Each section of this article will have links to the relevant section of the log. Click quotbackquot in your browser to return to this page. Setting up The first step in using mi commands is to mi set your data. This is somewhat similar to svyset. tsset. or xtset. The mi set command tells Stata how it should store the additional imputations youll create. We suggest using the wide format, as it is slightly faster. On the other hand, mlong uses slightly less memory. To have Stata use the wide data structure, type: To have Stata use the mlong (marginal long) data structure, type: The wide vs. long terminology is borrowed from reshape and the structures are similar. However, they are not equivalent and you would never use reshape to change the data structure used by mi. Instead, type mi convert wide or mi convert mlong (add, clear if the data have not been saved since the last change). Most of the time you dont need to worry about how the imputations are stored: the mi commands figure out automatically how to apply whatever you do to each imputation. But if you need to manipulate the data in a way mi cant do for you, then youll need to learn about the details of the structure youre using. Youll also need to be very, very careful. If youre interested in such things (including the rarely used flong and flongsep formats) run this do file and read the comments it contains while examining the data browser to see what the data look like in each form. Registering Variables The mi commands recognize three kinds of variables: Imputed variables are variables that mi is to impute or has imputed. Regular variables are variables that mi is not to impute, either by choice or because they are not missing any values. Passive variables are variables that are completely determined by other variables. For example, log wage is determined by wage, or an indicator for obesity might be determined by a function of weight and height. Interaction terms are also passive variables, though if you use Statas interaction syntax you wont have to declare them as such. Passive variables are often problematic8212the examples on transformations. non-linearity. and interactions show how using them inappropriately can lead to biased estimates. If a passive variable is determined by regular variables, then it can be treated as a regular variable since no imputation is needed. Passive variables only have to be treated as such if they depend on imputed variables. Registering a variable tells Stata what kind of variable it is. Imputed variables must always be registered: mi register imputed varlist where varlist should be replaced by the actual list of variables to be imputed. Regular variables often dont have to be registered, but its a good idea: mi register regular varlist Passive variables must be registered: mi register passive varlist However, passive variables are more often created after imputing. Do so with mi passive and theyll be registered as passive automatically. In our example data, all the variables except female need to be imputed. The appropriate mi register command is: mi register imputed race-wage (Note that you cannot use as your varlist even if you have to impute all your variables, because that would include the system variables added by mi set to keep track of the imputation structure.) Registering female as regular is optional, but a good idea: mi register regular female Checking the Imputation Model Based on the types of the variables, the obvious imputation methods are: race (categorical, three values): mlogit urban (binary): logit edu (ordered categorical, four values): ologit exp (continuous): regress wage (continuous): regress female does not need to be imputed, but should be included in the imputation models both because it is in the analysis model and because its likely to be relevant. Before proceeding to impute we will check each of the imputation models. Always run each of your imputation models individually, outside the mi impute chained context, to see if they converge and (insofar as it is possible) verify that they are specified correctly. Code to run each of these models is: mlogit race i. urban exp wage i. edu i. female logit urban i. race exp wage i. edu i. female ologit edu i. urban i. race exp wage i. female regress exp i. urban i. race wage i. edu i. female regress wage i. urban i. race exp i. edu i. female Note that when categorical variables (ordered or not) appear as covariates i. expands them into sets of indicator variables. As well see later, the output of the mi impute chained command includes the commands for the individual models it runs. Thus a useful shortcut, especially if you have a lot of variables to impute, is to set up your mi impute chained command with the dryrun option to prevent it from doing any actual imputing, run it, and then copy the commands from the output into your do file for testing. Convergence Problems The first thing to note is that all of these models run successfully. Complex models like mlogit may fail to converge if you have large numbers of categorical variables, because that often leads to small cell sizes. To pin down the cause of the problem, remove most of the variables, make sure the model works with whats left, and then add variables back one at a time or in small groups until it stops working. With some experimentation you should be able to identify the problem variable or combination of variables. At that point youll have to decide if you can combine categories or drop variables or make other changes in order to create a workable model. Prefect Prediction Perfect prediction is another problem to note. The imputation process cannot simply drop the perfectly predicted observations the way logit can. You could drop them before imputing, but that seems to defeat the purpose of multiple imputation. The alternative is to add the augment (or just aug ) option to the affected methods. This tells mi impute chained to use the quotaugmented regressionquot approach, which adds fake observations with very low weights in such a way that they have a negligible effect on the results but prevent perfect prediction. For details see the section quotThe issue of perfect prediction during imputation of categorical dataquot in the Stata MI documentation. Checking for Misspecification You should also try to evaluate whether the models are specified correctly. A full discussion of how to determine whether a regression model is specified correctly or not is well beyond the scope of this article, but use whatever tools you find appropriate. Here are some examples: Residual vs. Fitted Value Plots For continuous variables, residual vs. fitted value plots (easily done with rvfplot ) can be useful8212several of the examples use them to detect problems. Consider the plot for experience: regress exp i. urban i. race wage i. edu i. female rvfplot Note how a number of points are clustered along a line in the lower left, and no points are below it: This reflects the constraint that experience cannot be less than zero, which means that the fitted values must always be greater than or equal to the residuals, or alternatively that the residuals must be greater than or equal to the negative of the fitted values. (If the graph had the same scale on both axes, the constraint line would be a 45 degree line.) If all the points were below a similar line rather than above it, this would tell you that there was an upper bound on the variable rather than a lower bound. The y-intercept of the constraint line tells you the limit in either case. You can also have both a lower bound and an upper bound, putting all the points in a band between them. The quotobviousquot model, regress. is inappropriate for experience because it wont apply this constraint. Its also inappropriate for wages for the same reason. Alternatives include truncreg, ll(0) and pmm (well use pmm ). Adding Interactions In this example, it seems plausible that the relationships between variables may vary between race, gender, and urbanrural groups. Thus one way to check for misspecification is to add interaction terms to the models and see whether they turn out to be important. For example, well compare the obvious model: regress exp i. race wage i. edu i. urban i. female with one that includes interactions: regress exp (i. race i. urban i. female)(c. wage i. edu) Well run similar comparisons for the models of the other variables. This creates a great deal of output, so see the log file for results. Interactions between female and other variables are significant in the models for exp. wage. edu. and urban. There are a few significant interactions between race or urban and other variables, but not nearly as many (and keep in mind that with this many coefficients wed expect some false positives using a significance level of .05). Well thus impute the men and women separately. This is an especially good option for this data set because female is never missing. If it were, wed have to drop those observations which are missing female because they could not be placed in one group or the other. In the imputation command this means adding the by(female) option. When testing models, it means starting the commands with the by female: prefix (and removing female from the lists of covariates). The improved imputation models are thus: bysort female: reg exp i. urban i. race wage i. edu by female: logit urban exp i. race wage i. edu by female: mlogit race exp i. urban wage i. edu by female: reg wage exp i. urban i. race i. edu by female: ologit edu exp i. urban i. race wage pmm itself cannot be run outside the imputation context, but since its based on regression you can use regular regression to test it. These models should be tested again, but well omit that process. The basic syntax for mi impute chained is: mi impute chained ( method1 ) varlist1 ( method2 ) varlist2. regvars Each method specifies the method to be used for imputing the following varlist The possibilities for method are regress. pmm. truncreg. intreg. logit. ologit. mlogit. poisson. and nbreg. regvars is a list of regular variables to be used as covariates in the imputation models but not imputed (there may not be any). The basic options are: add( N ) rseed( R ) savetrace( tracefile. replace) N is the number of imputations to be added to the data set. R is the seed to be used for the random number generator8212if you do not set this youll get slightly different imputations each time the command is run. The tracefile is a dataset in which mi impute chained will store information about the imputation process. Well use this dataset to check for convergence. Options that are relevant to a particular method go with the method, inside the parentheses but following a comma (e. g. (mlogit, aug) ). Options that are relevant to the imputation process as a whole (like by(female) ) go at the end, after the comma. For our example, the command would be: mi impute chained (logit) urban (mlogit) race (ologit) edu (pmm) exp wage, add(5) rseed(4409) by(female) Note that this does not include a savetrace() option. As of this writing, by() and savetrace() cannot be used at the same time, presumably because it would require one trace file for each by group. Stata is aware of this problem and we hope this will be changed soon. For purposes of this article, well remove the by() option when it comes time to illustrate use of the trace file. If this problem comes up in your research, talk to us about work-arounds. Choosing the Number of Imputations There is some disagreement among authorities about how many imputations are sufficient. Some say 3-10 in almost all circumstances, the Stata documentation suggests at least 20, while White, Royston, and Wood argue that the number of imputations should be roughly equal to the percentage of cases with missing values. However, we are not aware of any argument that increasing the number of imputations ever causes problems (just that the marginal benefit of another imputation asymptotically approaches zero). Increasing the number of imputations in your analysis takes essentially no work on your part. Just change the number in the add() option to something bigger. On the other hand, it can be a lot of work for the computer8212multiple imputation has introduced many researchers into the world of jobs that take hours or days to run. You can generally assume that the amount of time required will be proportional to the number of imputations used (e. g. if a do file takes two hours to run with five imputations, it will probably take about four hours to run with ten imputations). So heres our suggestion: Start with five imputations (the low end of whats broadly considered legitimate). Work on your research project until youre reasonably confident you have the analysis in its final form. Be sure to do everything with do files so you can run it again at will. Note how long the process takes, from imputation to final analysis. Consider how much time you have available and decide how many imputations you can afford to run, using the rule of thumb that time required is proportional to the number of imputations. If possible, make the number of imputations roughly equal to the percentage of cases with missing data (a high end estimate of whats required). Allow time to recover if things to go wrong, as they generally do. Increase the number of imputations in your do file and start it. Do something else while the do file runs, like write your paper. Adding imputations shouldnt change your results significantly8212and in the unlikely event that they do, consider yourself lucky to have found that out before publishing. Speeding up the Imputation Process Multiple imputation has introduced many researchers into the world of jobs that take hours, days, or even weeks to run. Usually its not worth spending your time to make Stata code run faster, but multiple imputation can be an exception. Use the fastest computer available to you. For SSCC members that means learning to run jobs on Linstat, the SSCCs Linux computing cluster. Linux is not as difficult as you may think8212Using Linstat has instructions. Multiple imputation involves more reading and writing to disk than most Stata commands. Sometimes this includes writing temporary files in the current working directory. Use the fastest disk space available to you, both for your data set and for the working directory. In general local disk space will be faster than network disk space, and on Linstat ramdisk (a quotdirectoryquot that is actually stored in RAM) will be faster than local disk space. On the other hand, you would not want to permanently store data sets anywhere but network disk space. So consider having your do file do something like the following: Windows (Winstat or your own PC) This applies when youre using imputed data as well. If your data set is large enough that working with it after imputation is slow, the above procedure may help. Checking for Convergence MICE is an iterative process. In each iteration, mi impute chained first estimates the imputation model, using both the observed data and the imputed data from the previous iteration. It then draws new imputed values from the resulting distributions. Note that as a result, each iteration has some autocorrelation with the previous imputation. The first iteration must be a special case: in it, mi impute chained first estimates the imputation model for the variable with the fewest missing values based only on the observed data and draws imputed values for that variable. It then estimates the model for the variable with the next fewest missing values, using both the observed values and the imputed values of the first variable, and proceeds similarly for the rest of the variables. Thus the first iteration is often atypical, and because iterations are correlated it can make subsequent iterations atypical as well. To avoid this, mi impute chained by default goes through ten iterations for each imputed data set you request, saving only the results of the tenth iteration. The first nine iterations are called the burn-in period. Normally this is plenty of time for the effects of the first iteration to become insignificant and for the process to converge to a stationary state. However, you should check for convergence and increase the number of iterations if necessary to ensure it using the burnin() option. To do so, examine the trace file saved by mi impute chained. It contains the mean and standard deviation of each imputed variable in each iteration. These will vary randomly, but they should not show any trend. An easy way to check is with tsline. but it requires reshaping the data first. Our preferred imputation model uses by(). so it cannot save a trace file. Thus well remove by() for the moment. Well also increase the burnin() option to 100 so its easier to see what a stable trace looks like. Well then use reshape and tsline to check for convergence: preserve mi impute chained (logit) urban (mlogit) race (ologit) edu (pmm) exp wage female, add(5) rseed(88) savetrace(extrace, replace) burnin(100) use extrace, replace reshape wide mean sd, i(iter) j(m) tsset iter tsline expmean, title(quotMean of Imputed Values of Experiencequot) note(quotEach line is for one imputationquot) legend(off) graph export conv1.png, replace tsline expsd, title(quotStandard Deviation of Imputed Values of Experiencequot) note(quotEach line is for one imputationquot) legend(off) graph export conv2.png, replace restore The resulting graphs do not show any obvious problems: If you do see signs that the process may not have converged after the default ten iterations, increase the number of iterations performed before saving imputed values with the burnin() option. If convergence is never achieved this indicates a problem with the imputation model. Checking the Imputed Values After imputing, you should check to see if the imputed data resemble the observed data. Unfortunately theres no formal test to determine whats quotclose enough. quot Of course if the data are MAR but not MCAR, the imputed data should be systematically different from the observed data. Ironically, the fewer missing values you have to impute, the more variation youll see between the imputed data and the observed data (and between imputations). For binary and categorical variables, compare frequency tables. For continuous variables, comparing means and standard deviations is a good starting point, but you should look at the overall shape of the distribution as well. For that we suggest kernel density graphs or perhaps histograms. Look at each imputation separately rather than pooling all the imputed values so you can see if any one of them went wrong. The mi xeq: prefix tell Stata to apply the subsequent command to each imputation individually. It also applies to the original data, the quotzeroth imputation. quot Thus: mi xeq: tab race will give you six frequency tables: one for the original data, and one for each of the five imputations. However, we want to compare the observed data to just the imputed data, not the entire data set. This requires adding an if condition to the tab commands for the imputations, but not the observed data. Add a number or numlist to have mi xeq act on particular imputations: mi xeq 0: tab race mi xeq 15: tab race if missrace This creates frequency tables for the observed values of race and then the imputed values in all five imputations. If you have a significant number of variables to examine you can easily loop over them: foreach var of varlist urban race edu mi xeq 0: tab var mi xeq 15: tab var if missvar For results see the log file . Running summary statistics on continuous variables follows the same process, but creating kernel density graphs adds a complication: you need to either save the graphs or give yourself a chance to look at them. mi xeq: can carry out multiple commands for each imputation: just place them all in one line with a semicolon ( ) at the end of each. (This will not work if youve changed the general end-of-command delimiter to a semicolon.) The sleep command tells Stata to pause for a specified period, measured in milliseconds. mi xeq 0: kdensity wage sleep 1000 mi xeq 15: kdensity wage if missvar sleep 1000 Again, this can all be automated: foreach var of varlist wage exp mi xeq 0: sum var mi xeq 15: sum var if missvar mi xeq 0: kdensity var sleep 1000 mi xeq 15: kdensity var if missvar sleep 1000 Saving the graphs turns out to be a bit trickier, because you need to give the graph from each imputation a different file name. Unfortunately you cannot access the imputation number within mi xeq. However, you can do a forvalues loop over imputation numbers, then have mi xeq act on each of them: forval i15 mi xeq i: kdensity exp if missexp graph export expi. png, replace Integrating this with the previous version gives: foreach var of varlist wage exp mi xeq 0: sum var mi xeq 15: sum var if missvar mi xeq 0: kdensity var graph export chkvar0.png, replace forval i15 mi xeq i: kdensity var if missvar graph export chkvari. png, replace For results, see the log file . Its troublesome that in all imputations the mean of the imputed values of wage is higher than the mean of the observed values of wage. and the mean of the imputed values of exp is lower than the mean of the observed values of exp. We did not find evidence that the data is MAR but not MCAR, so wed expect the means of the imputed data to be clustered around the means of the observed data. There is no formal test to tell us definitively whether this is a problem or not. However, it should raise suspicions, and if the final results with these imputed data are different from the results of complete cases analysis, it raises the question of whether the difference is due to problems with the imputation model. Last Revised: 8232012NOTICE: The IDRE Statistical consulting group will be migrating the website to the WordPress CMS in February to facilitate maintenance and creation of new content. ستتم إزالة بعض صفحاتنا القديمة أو وضعها في الأرشيف بحيث لا يتم الاحتفاظ بها بعد الآن. سنحاول الحفاظ على عمليات إعادة التوجيه بحيث تستمر عناوين ورل القديمة في العمل بأفضل ما في وسعنا. Welcome to the Institute for Digital Research and Education Help the Stat Consulting Group by giving a gift Statistical Computing Seminars Missing Data in SAS Part 1 Note: A PowerPoint presentation of this webpage can be downloaded here . Introduction Missing data is a common issue, and more often than not, we deal with the matter of missing data in an ad hoc fashion. The purpose of this seminar is to discuss commonly used techniques for handling missing data and common issues that could arise when these techniques are used. In particular, we will focus on the one of the most popular methods, multiple imputation. We are not advocating in favor of any one technique to handle missing data and depending on the type of data and model you will be using, other techniques such as direct maximum likelihood may better serve your needs. We have chosen to explore multiple imputation through an examination of the data, a careful consideration of the assumptions needed to implement this method and a clear understanding of the analytic model to be estimated. We hope this seminar will help you to better understand the scope of the issues you might face when dealing with missing data using this method. The data set hsbmar. sas7bdat which is based on hsb2.sas7bdat used for this seminar can be downloaded in following the link. The SAS code for this seminar is developed u sing SAS 9.4 and SASSTAT 13.1. So me of the variables have value labels (formats) associated with them. Here is the setup for reading the value labels correctly. Goals of statistical analysis with missing data: Minimize bias Maximize use of available information Obtain appropriate estimates of uncertainty Exploring missing data mechanisms The missing data mechanism describes the process that is believed to have generated the missing values. Missing data mechanisms generally fall into one of three main categories. There are precise technical definitions for these terms in the literature the following explanation necessarily contains simplifications. Missing completely at random (MCAR) A variable is missing completely at random, if neither the variables in the dataset nor the unobserved value of the variable itself predict whether a value will be missing. Missing completely at random is a fairly strong assumption and may be relatively rare. One relatively common situation in which data are missing completely at random occurs when a subset of cases is randomly selected to undergo additional measurement, this is sometimes referred to as quotplanned missing. quot For example, in some health surveys, some subjects are randomly selected to undergo more extensive physical examination therefore only a subset of participants will have complete information for these variables. Missing completely at random also allow for missing on one variable to be related to missing on another, e. g. var1 is missing whenever var2 is missing. For example, a husband and wife are both missing information on height. A variable is said to be missing at random if other variables (but not the variable itself) in the dataset can be used to predict missingness on a given variable. For example, in surveys, men may be more likely to decline to answer some questions than women (i. e. gender predicts missingness on another variable). MAR is a less restrictive assumption than MCAR. Under this assumption the probability of missingness does not depend on the true values after controlling for the observed variables. MAR is also related to ignorability. The missing data mechanism is said be ignorable if it is missing at random and the probability of a missingness does not depend on the missing information itself. The assum ption of ignorability is needed for optimal estimation of missing information and is a required assumption for both of the missing data techniques we will discuss. Missing not at random (MNAR) Finally, data are said to be missing not at random if the value of the unobserved variable itself predicts missingness. A classic example of this is income. Individuals with very high incomes are more likely to decline to answer questions about their income than individuals with more moderate incomes. An understanding of the missing data mechanism(s) present in your data is important because different types of missing data require different treatments. When data are missing completely at random, analyzing only the complete cases will not result in biased parameter estimates (e. g. regression coefficients). However, the sample size for an analysis can be substantially reduced, leading to larger standard errors. In contrast, analyzing only complete cases for data that are either missing at random, or missing not at random can lead to biased parameter estimates. Multiple imputation and other modern methods such as direct maximum likelihood generally assumes that the data are at least MAR, meaning that this procedure can also be used on data that are missing completely at random. Statistical models have also been developed for modeling the MNAR processes however, these model are beyond the scope of this seminar. For more information on missing data mechanisms please see: Allison, 2002 Enders, 2010 Little amp Rubin, 2002 Rubin, 1976 Schafer amp Graham, 2002 Full data: Below is a regression model predicting read using the complete data set ( hsb2 ) used to create hsbmar . We will use these results for comparison. Common techniques for dealing with missing data In this section, we are going to discuss some common techniques for dealing with missing data and briefly discuss their limitations. Complete case analysis (listwise deletion) Available case analysis (pairwise deletion) Mean Imputation Single Imputation Stochastic Imputation 1. Complete Case Analysis: This methods involves deleting cases in a particular dataset that are missing data on any variable of interest. It is a common technique because it is easy to implement and works with any type of analysis. Below we look at some of the descriptive statistics of the data set hsbmar . which contains test scores, as well as demographic and school information for 200 high school students. Note that although the dataset contains 200 cases, six of the variables have fewer than 200 observatio ns. The missing information varies between 4.5 ( read ) and 9 ( female and prog ) of cases depending on the variable. This doe snt seem like a lot of missing data, so we might be inclined to try to analyze the observed data as they are, a strategy sometimes referred to as complete case analysis. Below is a regression model where the dependent variable read is regressed on write . math, female and prog . Notice that the default behavior of proc glm is complete case analysis (also referred to as listwise deletion). Looking at the output, we see that only 130 cases were used in the analysis in other words, more than one third of the cases in our dataset (70200) were excluded from the analysis because of missing data. The reduction in sample size (and statistical power) alone might be considered a problem, but complete case analysis can also lead to biased estimates. Specifically you will see below that the estimates for the intercept, write, math and prog are different from the regression model on the complete data. Also, the standard errors are all larger due to the smaller sample size, resulting in the parameter estimate for female almost becoming non-significant. Unfortunately, unless the mechanism of missing data is MCAR, this method will introduce bias into the parameter estimates. 2. Available Case Analysis: This method involves estimating means, variances and covariances based on all available non-missing cases. Meaning that a covariance (or correlation) matrix is computed where each element is based on the full set of cases with non-missing values for each pair of variables. This method became popular because the loss of power due to missing information is not as substantial as with complete case analysis. Below we look at the pairwise correlations between the outcome read and each of the predictors, write, prog, female, and math. Depending on the pairwise comparison examined, the sample size will change based on the amount of missing present in one or both variables. Because proc glm does not accept covariance matrices as data input, the following example will be done with proc reg . This will require us to create dummy variables for our categorical predictor prog since there is no class statement in proc reg . By default proc corr uses pairwise deletion to estimate the correlation table. The options on the proc corr statement, cov and outp . will output a variancecovariance matrix based on pairwise deletion that will be used in the subsequent regression model The first thing you should see is the note that SAS prints to your log file stating quotN not equal across variables in data set. This may not be appropriate. The smallest value will be used. quot. One of the main drawbacks of this method is no consistent sample size. You will also notice that the parameter estimates presented here are different than the estimates obtained from analysis on the full data and the listwise deletion approach. For instance, the variable female had an estimated effect of -2.7 with the full data but was attenuated to -1.85 for the available case analysis. Unless the mechanism of missing data is MCAR, this method will introduce bias into the parameter estimates. Therefore, this method is not recommended. 3. Unconditional Mean Imputation: This methods involves replacing the missing values for an individual variable with it39s overall estimated mean from the available cases. While this is a simple and easily implemented method for dealing with missing values it has some unfortunate consequences. The most important problem with mean imputation, also called mean substitution, is that it will result in an artificial reduction in variability due to the fact you are imputing values at the center of the variable39s distribution. This also has the unintended consequence of changing the magnitude of correlations between the imputed variable and other variables. We can demonstrate this phenomenon in our data. Below are tables of the means and standard deviations of the four variables in our regression model BEFORE and AFTER a mean imputation as well as their corresponding correlation matrices. We will again utilize the prog dummy variables we created previously. You will notice that there is very little change in the mean (as you would expect) however, the standard deviation is noticeably lower after substituting in mean values for the observations with missing information. This is because you reduce the variability in your variables when you impute everyone at the mean. Moreover, you can see the table of quotPearson Correlation Coefficientsquot that the correlation between each of our predictors of interest ( write . math . female . and prog ) as well as between predictors and the outcome read have now be attenuated. Therefore, regression models that seek to estimate the associations between these variables will also see their effects weakened. 4 - Single or Deterministic Imputation : A slightly more sophisticated type of imputation is a regressionconditional mean imputation, which r eplaces missing values with predicted scores from a regression equation. The strength of this approach is that it uses complete information to impute values. The drawback here is that all your predicted values will fall directly on the regression line once again decreasing variability, just not as much as with unconditional mean imputation. Moreover, statistical models cannot distinguish between observed and imputed values and therefore do not incorporate into the model the error or uncertainly associated with that imputedva lue. Additionally, you will see that this method will also inflate the associations between variables because it imputes values that are perfectly correlated with one another. Unfortunately, even under the assumption of MCAR, regression imputation will upwardly bias correlations and R-squared statistics. Further discussion and an example of this can be found in Craig Enders book quotApplied Missing Data Analysisquot (2010). 5. Stochastic Imputation : In recognition of the problems with regression imputation and the reduced variability associated with this approach, researchers developed a technique to incorporate or quotadd backquot lost variability. A residual term, that is randomly drawn from a normal distribution with mean zero and variance equal to the residual variance from the regression model, is added to the predicted scores from the regression imputation thus restoring some of the lost variability. This method is superior to the previous methods as it will produce unbiased coefficient estimates under MAR. However, the standard errors produced during regression estimation while less biased then the single imputation approach, will still be attenuated. While you might be inclined to use one of these more traditional methods, consider this statement: quotMissing data analyses are difficult because there is no inherently correct methodological procedure. In many (if not most) situations, blindly applying maximum likelihood estimation or multiple imputation will likely lead to a more accurate set of estimates than using one of the previously mentioned missing data handling techniquesquot (p.344, Applied Missing Data Analysis, 2010). Multiple Imputation Multiple imputation is essentially an iterative form of stochastic imputation. However, instead of filling in a single value, the distribution of the observed data is used to estimate multiple values that reflect the uncertainty around the true value. These values are then used in the analysis of interest, such as in a OLS model, and the results combined. Each imputed value includes a random component whose magnitude reflects the extent to which other variables in the imputation model cannot predict it39s true values (Johnson and Young, 2011 White et al, 2010). Thus, building into the imputed values a level of uncertainty around the quottruthfulnessquot of the imputed values. A common misconception of missing data methods is the assumption that imputed values should represent quotrealquot values. The purpose when addressing missing data is to correctly reproduce the variancecovariance matrix we would have observed had our data not had any missing information. MI has three basic phases: 1. Imputation or Fill-in Phase: The missing data are filled in with estimated values and a complete data set is created. This process of fill-in is repeated m times. 2. Analysis Phase: Each of the m complete data sets is then analyzed using a statistical method of interest (e. g. linear regression). 3. Pooling Phase: The parameter estimates (e. g. coefficients and standard errors) obtained from each analyzed data set are then combined for inference. The imputation method you choose depends on the pattern of missing information as well as the type of variable(s) with missing information. Imputation Model, Analytic Model and Compatibility : When developing your imputation model, it is important to assess if your imputation model is quotcongenialquot or consistent with your analytic model. Consistency means that your imputation model includes (at the very least) the same variables that are in your analytic or estimation model. This includes any tr ansformations to variables that will be needed to assess your hypothesis of interest. This can include log transformations, interaction terms, or recodes of a continuous variable into a categorical form, if that is how it will be used in later analysis. The reason for this relates back to the earlier comments about the purpose of multiple imputation. Since we are trying to reproduce the proper variancecovariance matrix for estimation, all relationships between our analytic variables should be represented and estimated simultaneously. Otherwise, you are imputing values assuming they have a correlation of zero with the variables you did not include in your imputation model. This would result in underestimating the association between parameters of interest in your analysis and a loss of power to detect properties of your data that may be of interest such as non-linearities and statistical interactions. For additional reading on this particular topic see: 1. von Hippel, 2009 2. von Hippel, 2013 3. White et al. 2010 Preparing to conduct MI: First step: Examine the number and proportion of missing values among your variables of interest. The proc means procedure in SAS has an option called nmiss that will count the number of missing values for the variables specified. You can also create missing data flags or indicator variables for the missing information to assess the proportion of missingness. This quotMissing Data Patternsquot table can be requested without actually performing a full imputation by specifying the option nimpute0 (specifying zero imputed datasets to be created) on the proc mi statement line. Each quotgroupquot represents a set of observations in the data set that share the same pattern of missing information. For example, group 1 represents the 130 observations in the data that have complete information on all 5 variables of interest. This procedure also provides means for each variable for this group. You can see that there are a total of 12 patterns for the specified variables. The estimated means associated with each missing data pattern can also give you an indication of whether the assumption MCAR or MAR is appropriate. If you begin to observe that those with certain missing data patterns appear to have a very different distribution of values, this is an indication that you data may not be MCAR. Moreover, depending on the nature of the data, you may recognize patterns such as monotone missing which can be observed in longitudinal data when an individual drops out at a particular time point and therefore all data after that is subsequently missing. Additionally, you may identify skip patterns that were missed in your original review of the data that should then be dealt with before moving forward with the multiple imputation. Third Step: If necessary, identify potential auxiliary variables Auxiliary variables are variables in your data set that are either correlated with a missing variable(s) (the recommendation is r gt 0.4) or are believed to be associated with missingness. These are factors that are not of particular interest in your analytic model. but they are added to the imputation model to increase power andor to help make the assumpti on of MAR more plausible. These variables have been found to improve the quality of imputed values generate from multiple imputation. Moreover, research has demonstrated their particular importance when imputing a dependent variable andor when you have variables with a high proportion of missing information (Johnson and Young, 2011 Young and Johnson, 2010 Enders. 2010). You may a priori know of several variables you believe would make good auxiliary variables based on your knowledge of the data and subject matter. Additionally, a good review of the literature can often help identify them as well. However, if your not sure what variables in the data would be potential candidates (this is often the case when conducting analysis secondary data analysis), you can uses some simple methods to help identify potential candidates. One way to identify these variables is by examining associations between write, read, female, and math with other variables in the dataset. For example, let39s take a look at the correlation matrix between our 4 variables of interest and two other test score variables science and socst . Science and socst both appear to be a good auxiliary because they are well correlated (r gt0.4) with all the other test score variables of interest. You will also notice that they are not well correlated with female . A good auxiliary does not have to be correlated with every variable to be used. You will also notice that science also has missing information of it39s own. Additionally, a good auxiliary is not required to have complete information to be valuable. They can have missing and still be effective in reducing bias (Enders, 2010). One area, this is still under active research, is whether it is beneficial to include a variable as an auxiliary if it does not pass the 0.4 correlation threshold with any of the variables to be imputed. Some researchers believe that including these types o f items introduces unnecessary error into the imputation model (Allison, 2012), while others do not believe that there is any harm in this practice (Ender, 2010). وهكذا. we leave it up to you as the researcher to use your best judgment. Good auxiliary variables can also be correlates or predictors of missingness. Let39s use the missing data flags we made earlier to help us identify some variables that may be good correlates. We examine if our potential auxiliary variable socst also appears to predict missingness. Below are a set of t-tests to test if the mean socst or science scores differ significantly between those with missing information and those without. The only significant difference was found when examining missingness on math with socst. Above you can see that the mean socst score is significantly lower among the respondents who are missing on math. This suggests that socst is a potential correlate of missingness (Enders, 2010) and may help us satisfy the MAR assumption for multiple imputation by including it in our imputation model. Example 1: MI using multivariate normal distribution (MVN): When choosing to impute one or many variables, one of the first decisions you will make is the type of distribution under which you wa nt to impute your variable(s). One method available in SAS uses Markov Chain Monte Carlo (MCMC) which assumes that all the variables in the imputation model have a joint multivariate normal distribution. This is probably the most common parametric approach for multiple imputation. The specific algorithm used is called the data augmentation (DA) algorithm, which belongs to the family of MCMC procedures. The algorithm fills in missing data by drawing from a conditional distribution, in this case a multivariate normal, of the missing data given the observed data. In most cases, simul ation studies have shown that assuming a MVN distribution leads to reliable estimates even when the normality assumption is violated given a sufficient sample size (Demirtas et al. 2008 KJ Lee, 2010). Ho wever, biased estimates have been observed when the sample size is relatively small and the fraction of missing information is high. Note: Since we are using a multivariate normal distribution for imputation, decimal and negative values are possible. These values are not a problem for estimation however, we will need to create dummy variables for the nominal categorical variables so the parameter estiamtes for each level can be interpreted. Imputation in SAS requires 3 procedures. The first is proc mi where the user specifies the imputation model to be used and the number of imputed datasets to be created. The second procedure runs the analytic model of interest (here it is a linear regression using proc glm ) within each of the imputed datasets. The third step runs a procedure call proc mianalyze which combines all the estimates (coefficients and standard errors) across all the imputed datasets and outputs one set of parameter estimates for the model of interest. On the proc mi procedure line we can use the nimpute option to specify the number of imputations to be performed. The imputed datasets will be outputted using the out option, and stored appended or quotstackedquot together in a dataset called quotmimvnquot. An indicator variables called imputation is automatically created by the procedure to number each new imputed dataset. After the var statement, all the variables for the imputation model are specified including all the variables in the analytic model as well as any auxiliary variables. The option seed is not required, but since MI is designed to be a random process, setting a seed will allow you to obtain the same imputed dataset each time. This estimates the linear regression model for each imputed dataset individually using the by statement and the indicator variable created previously. You will observe in the Results Viewer, that SAS outputs the parameter estimates for each of the 10 imputations. The output statement stores the parameter estimates from the regression model in the dataset named quotamvn. quot This dataset will be used in the next step of the process, the pooling phase. Proc mianalyze uses the dataset quotamvnquot that contains the parameter estimates and associated covariance matrices for each imputation. The variancecovariance matrix is needed to estimate the standard errors. This step combines the parameter estimates into a single set of statistics that appropriately reflect the uncertainty associated with the imputed values. The coefficients are simply just an arithmetic mean of the individual coefficients estimated for each of the 10 regression models. Averaging the parameter estimates dampens the variation thus increasing efficiency and decreasing sampling variation. Estimation of the standard error for each variable is little more complicated and will be discussed in the next section. If you compare these estimates to those from the complete data you will observe that they are, in general, quite comparable. The variables write female and math . are significant in both sets of data. You will also observe a small inflation in the standard errors, which is to be expected since the multiple imputation process is designed to build additional uncertainty into our estimates. 2. Imputation Diagnostics: Above the quotParameter Estimatesquot table in the SAS output above you will see a table called quotVariance Informationquot. It is important to examine the output from proc mianalyze, as several pieces of the information can be used to assess how well the imputation performed. Below we discuss each piece: Variance Between (V B ): This is a measure of the variability in the parameter estimates (coefficients) obtained from the 10 imputed datasets For example, if you took all 10 of the parameter estimates for write and calculated the variance this would equal V B 0.000262. This variability estimates the additional variation (uncertainty) that results from missing data. Variance Within (V W ): This is simply the arithmetic mean of the sampling variances (SE) from each of the 10 imputed datasets. For example, if you squared the standard errors for write for all 10 imputations and then divided by 10, this would equal, this would equal V w 0.006014. This estimates the sampling variability that we would have expected had there been no missing data. Variance Total (V T ): The primary usefulness of MI comes from how the total variance is estimated. T he total variance is sum of multiple sources of variance. While regression coefficients are just averaged across imputations, Rubin39s formula (Rubin, 1 987) p artitions variance into quotwithin imputationquot capturing the expected uncertainty and quotbetween imputationquot capturing the estimation variability due to missing information (Graham, 2007 White et al. 2010). The total variance is the sum of 3 sources of variance. The within, the between and an additional source of sampling variance. For example, the total variance for the variable write would be calcualted like this: V B V w V B m 0.000262 0.006014 0.00026210 0.006302 The additional sampling variance is literally the variance between divided by m . This value represents the sampling error associated with the overall or average coefficient estimates. It is used as a correction factor for using a specific number of imputations. This value becomes small er, the more imputations are conducted. The idea being that the larger the number of imputations, the more precise the parameter estimates will be. Bottom line: The main difference between multiple imputation and other single imputation methods, is in the estimation of the variances. The SE39s for each parameter estimate are the square root of it39s V T . Degrees of Freedom (DF): Unlike analysis with non-imputed data, sample size does not directly influence the estimate of DF. DF actually continues to increase as the number of imputations increase. The standard formula used to calculate DF can result in fractional estimates as well as estimates that far exceed the DF that would had resulted had the data been complete. By default the DF infinity. Note: Starting is SAS v.8, a formula to adjust for the problem of inflated DF has been implemented (Barnard and Rubin, 1999). Use the EDF option on the proc mianalyze line to indicate to SAS what the proper adjusted DF. Bottom line: The standard formula assumes that the estimator has a normal distribution, i. e. a t-distribution with infinite degrees of freedom. In large samples this is not usually an issue but can be with smaller sample sizes. In that case, the corrected formula should be used (Lipsitz et al. 2002). Relative Increases in Variance (RIVRVI): Proportional increase in total sampling variance that is due to missing information (V B V B m V W ). For example, the RVI for write is 0.048, this means that the estimated sampling variance for write is 4.8 larger than its sampling variance would have been had the data on write been complete. Bottom line: Variables with large amounts of missing andor that are weakly correlated with other variables in the imputation model will tend to have high RVI39s. Fraction of Missing Information (FMI): Is directly related to RVI. Proportion of the total sampling variance that is due to missing data (V B V B m V T ) . It39s estimated based on the percentage missing for a particular variable and how correlated this variable is with other variables in the imputation model. The interpretation is similar to an R-squared. So an FMI of 0.046 for write means that 4.6 of the total sampling variance is attributable to missing data. The accuracy of the estimate of FMI increases as the number imputation increases because varaince estimates become more stable. This especially important in the presence of a variable(s) with a high proportion of missing information. If convergence of your imputation model is slow, examine the FMI estimates for each variables in your imputation model. A high FMI can indicate a problematic variable. Bottom line: If FMI is high for any particular variable(s) then consider increasing the number of imputations. A good rule of thumb is to have the number imputations (at least) equal the highest FMI percentage. Relative Efficiency: The relative efficiency (RE) of an imputation (how well the true population parameters are estimated) is related to both the amount of missing information as well as the number ( m) of imputations performed. When the amount of missing information is very low then efficiency may be achieved by only performing a few imputations (the minimum number given in most of the literature is 5). However when there is high amount of missing information, more imputations are typically necessary to achieve adequate efficiency for parameter estimates. You can obtain relatively good efficiency even with a small number of m. However, this does not mean that the standard errors will be well estimated well. More imputations are often necessary for proper standard erro r estimation as the variability between imputed datasets incorporate the necessary amount of uncertainty around the imputed values. The direct relationship between RE, m and the FMI is: 1(1FMI m ) . This formula represent the RE of using m imputation versus the infinte number of imputations. To get an idea of what this looks like practically, take a look at the figure below from the SAS documentation where m is the number of imputations and lambda is the FMI. Bottom line: It may appear that you can get good RE with a few imputations however, it often takes more imputations to get good estimates of the variances than good estimates of parameters like means or regression coefficients. After performing an imputation it is also useful to look at means, frequencies and box plots comparing observed and imputed values to assess if the range appears reasonable. You may also want to examine plots of residuals and outliers for each imputed dataset individually. If anomalies are evident in only a small number of imputations then this indicates a problem with the imputation model (White et al, 2010). You should also assess convergence of your imputation model. This should be done for different imputed variables, but specifically for those variables with a high proportion of missing (e. g. high FMI). Convergence of the proc mi procedure means that DA algorithm has reached an appropriate stationary posterior distribution. Convergence for each imputed variable can be assessed using trace plots. These plots can be requested on the mcmc statement line in the proc mi procedure. Long-term trends in trace plots and high serial dependence are indicative of a slow convergence to stationarity. A stationary process has a mean and variance that do not change over time. By default SAS will provide a trace plots of estimates for the means for each variable but you can also ask for these for the standard deviation as well. You can take a look at examples of good and bad trace plots in the SAS users guide section on quotAssessing Markov Chain Convergence quot. Above is an example of a trace plot for mea n social studies score. There are two main things you want to note in a trace plot. First, assess whether the algorithm appeared to reach a stable posterior distribution by examining the plot to see if the mean remains relatively constant and that there appears to be an absence of any sort of trend (indicating a sufficient amount of randomness in the means between iterations). In our case, this looks to be true. Second, you want to examine the plot to see how long it takes to reach this stationary phase. In the above example it looks to happen almost immediately, indicating good convergence. The dotted lines represent at what iteration and imputed dataset is drawn. By default the burn-in period (number of iterations before the first set of imputed values is drawn) is 200. This can be increased if it appears that proper convergence is not achieved using the nbiter option on the mcmc statement. Another plot that is very useful for assessing convergence is the auto correlation plot also specified on the mcmc statement using plotsacf. This helps us to assess possible auto correlation of parameter values between iterations. Let39s say you noticed a trend in the mean social studies scores in the previous trace plot. You may want to assess the magnitude of the observed dependency of scores across iterations. The auto correlation plot will show you that. In the plot below, you will see that the correlation is perfect when the mcmc algorithm starts but quickly goes to near zero after a few iterations indicating almost no correlation between iterations and therefore no correlation between values in adjacent imputed datasets. By default SAS, draws an imputed dataset every 100 iterations, if correlation appears high for more than that, you will need to increase the number of iterations between imputed datasets using the niter option. Take a look at the SAS 9.4 proc mi documentation for more information about this and other options. Note: The amount of time it takes to get to zero (or near zero) correlation is an indication of convergence time (Enders, 2010). For more information on these and other diagnostic tools, please se e Ender, 2010 and Rubin, 1987. Example 2: MI using fully conditional specification (also known as imputation by chained equationsICE or sequential generalized regression ) A second method available in SAS imputes missing variables using the fully conditional method (FCS) which does not assume a joint distribution but instead uses a separate conditio nal distribution for each imputed variable. This specification may be necessary if your are imputing a variable that must only take on specific values such as a binary outcome for a logistic model or count variable for a poisson model. In simulation studies (Lee amp Carlin, 2010 Van Buuren, 2007), the FCS has been show to produce estimates that are comparable to MVN method. Later we will discuss some diagnostic tools that can be used to assess if convergence was reached when using FCS. The FCS methods available is SAS are discriminant function and logistic regression for binarycategorical variables and linear regression and predictive mean matching for continuous variables. If you do not specify a method, by default the discriminant function and regression are used. Some interesting properties of each of these options are: 1. The discriminant function method allows for the user to specify prior probabilities of group membership. In discriminant function only continuous variables can be covariates by default. To change this default use the classeffects option. 2. The logistic regression method assumes ordering of class variables if more then two levels. 3. The default imputation method for continuous variables is regression. The regression method allows for the use of ranges and rounding for imputed values. These options are prob lematic and typically introduce bias (Horton et al. 2003 Allison, 2005). Take a look at the quotOther Issuesquot section below, for further discussion on this topic. 4. The predictive mean matching method will provide imputed values that are consistent with observed values. If plausible values are necessary, this is a better choice then using bounds or rounding values produced from regression. For more information on these methods and the options associated with them, see SAS Help and Documentation on the FCS Statement . The basic set-up for conducting an imputation is shown below. The var statement includes all the variables that will be used in the imputation model. If you want to impute these variables using method different then the default you can specify which variable(s) is to be imputed and by what method on the FCS statement. In this example we are imputing the binary variable female and the categorical variable prog using the discriminant function method. Since they are both categorical, we also list female and prog on the class statement. Note: Because we are using the discriminant function method to impute prog we no longer need to create dummy variables. Additionally, we use the classeffectsinclude option so all continuous and categorical variables will be used as predictors when imputing female and prog . All the other variables on var statement will be imputed using regression since a different distribution was not specified. The ordering of variables on the var statement controls in which order variables will be imputed. With multiple imputation using FCS, a single imputation is conducted during an initial fill-in stage. After the initial stage, the variables with missing values are imputed in the order specified on the var statement. With subsequent variable being imputed using observed and imputed values from the variables that proceeded them. For more information on this see White et al. 2010. Also as in the previous proc mi example using MVN, we can also specify the number of burn-in interations using the option nbiter . The FCS statement also allows users to specify which variable you want to use as predictors, if no covariates are given from the imputed variable then SAS assumes that all the variables on the var statement are to be used to predict all other variables. Multiple conditional distributions can be specified in the same FCS statement. Take a look at the examples below. This specification, imputes female and prog under a generalized logit distribution that is appropriate for non-ordered categorical variables instead of the default cumulative logit that is appropriate for ordered variables. This second specification, imputes female and prog under a generalized logit distribution and uses predictive mean matching to impute math, read and write instead of the default regression method. This third specification, indicates that prog and female should be imputed using a different sets of predictors. 2. Analysis and Pooling Phase Once the 20 multiply imputed datasets have been created, we can run our linear regression using proc genmod . Since we imputed female and prog under a distribution appropriate for categorical outcomes, the imputed values will now be true integer values. Take a look at the results of proc freq for female and prog in the second imputed dataset as compared to original data with missing values. As you can see, the FCS method has imputed quotrealquot values for our categorical variables. Prog and female can now be used in the class statement below and we no longer need to create dummy variables for prog . As with the previous example using MVN, we will run our model on each imputed dataset stored in mifcs . We will also use an ODS Output statement to save the parameter estimates from our 20 regressions. Below is a proc print of what the parameter estimates in gmfcs look like for the first two imputed datasets. quot Imputation quot indicates which imputed dataset each set of parameters estimates belong to. quotLevel1quot indicates the levels or categories for our class variables. The mianalyze procedure will now require some additional specification in order to properly combine the parameter estimates. You can see above that the parameter estimates for variables used in our model39s class statement have estimates with 1 row for each level. Additionally, a column called quotLevel1quot specifies the name or label associated with each category. In order from mianalyze to estimate the combined estimates appropriately for the class variables we need to add some options to the proc mianalyze line. As before the parms refers to input SAS data set that contains parameter estimates computed from each imputed data set. However, we also need the option classvar added. This option is only appropriate when the model effects contain classification variables. Since proc genmod names the column indicator for classification quotLevel1quot we will need to specify classvarlevel . Note: Different procedures in SAS require different classvar options. If you compare these estimates to those from the full data (below) you will see that the magnitude of the write . female . and math parameter estimates using the FCS data are very similar to the results from the full data. Additionally, the overall significance or non-significance of specific variables remains unchanged. As with the MVN model, the SE are larger due to the incorporation of uncertainty around the parameter estimates, but these SE are still smaller then we observed in the complete cases analysis. 4. Imputation Diagnostics: Like the previous imputation method with MVN . the FCS statement will output trace plots. These can be examined for the mean and standard deviation of each continuous variable in the imputation model. As before, the dashed vertical line indicates the final iteration where the imputation occurred. Each line represents a different imputation. So all 20 imputation chains are overlayed on top of one another. Autocorrelation plots are only available with the mcmc statement when assuming a joint multivariate normal distribution. This plot is not available when using the FCS statement. 1. Why Auxiliary variables So one question you may be asking yourself, is why are auxiliary variables necessary or even important. First, they can help improve the likelihood of meeting the MAR assum ption (White et al, 2011 Johnson and Young, 2011 Allison, 2012). Remember, a variable is said to be missing at random if other variables in the dataset can be used to predict missingness on a given variable. So you want your imputation model to include all the variables you think are associated with or p redict missingness in your variable in order to fulfill the assumption of MAR. Second, including auxiliaries has been shown to help yield more accurate and stable estimates and thus reduce the estimated standard errors in analytic models (Enders, 2010 Allison, 2012 von Hippel and Lynch, 2013). This is especially true in the case of missing outcome variables. Third. including these variable can also help to increase po wer (Reis and Judd, 2000 Enders, 2010). In general, there is almost always a benefit to adopting a more quotinclusive analysis str ategyquot (Enders, 2010 Allison, 2012). 2. Selecting the number of imputations ( m ) Historically, the recommendation was for three to five MI datasets. Relatively low values of m may still be appropriate when the fraction of missing information is low and the analysis techniques are relatively simple. Recently, however, larger values of m are often being recommended. To some extent, this change in the recommended number of imputations is based on the radical increase in the computing power available to the typical researcher, making it more practical to run, create and analyze multiply imputed datasets with a larger number of imputations. Recommendations for the number of m vary. For example, five to 20 imputations for low fractions of missing information, and as many as 50 (or more) imputations when the proportion of missing data is relatively high. Remember that estimates of coefficients stabilize at much lower values of m than estimates of variances and covariances of error terms (i. e. standard errors). Thus, in order to get appropriate estimates of these parameters, you may need to increase the m. A larger number of imputations may also allow hypothesis tests with less restrictive assumptions (i. e. that do not assume equal fractions of missing information for all coefficients). Multiple runs of m imputations are recommended to assess the stability of the parameter estimates. Graham et al. 2007 conducted a simulation demonstrating the affect on power, efficiency and parameter estimates across different fractions of missing information as you decrease m. The authors found that: 1. Mean square error and standard error increased. 2. Power was reduced, especially when FMI is greater than 50 and the effect size is small, even for a large number of m (20 or more). 3. Variability of the estimate of FMI increased substantially. بشكل عام. the estimation of FMI improves with an increased m . Another factor to consider is the importance of reproducibility between analyses using the same data. White et al. (2010), ass uming the true FMI for any variable would be less than or equal to the percentage of cases that are incomplete, uses the rule m should equal the percentage of incomplete cases. Thus if the FMI for a variable is 20 then you need 20 imputed datasets. A similar analysis by Bodner, 2008 makes a similar recommendation. White et al. 2010 also found when making this assumption, the error associated with estimating the regression coefficients, standard errors and the resulting p-values was considerably reduced and resulted in an adequate level of reproducibility. 3. Maximum, Minimum and Round This issue often comes up in the context of using MVN to impute variables that normally have integer values or bounds. Intuitively speaking, it makes sense to round values or incorporate bounds to give quotplausiblequot values. However, these methods has been shown to decrease efficiency and increase bias by altering the correlation or covariances between variables estimated during the imputation process. Additionally, these changes will often result in an underestimation of the uncertainly around imputed values. Remember imputed values are NOT equivalent to observed values and serve only to help estimate the covariances between variables needed for inference (Johnson and Young 2011). Leaving the imputed values as is in the imputation model is perfectly fine for your analytic models. If plausible values are needed to perform a specific type of analysis, than you may want to use a different imputation algorithm such as FCS . Isn39t multiple imputation just making up data No. This is argument can be made of the missing data methods that use a single imputed value because this value will be treated like observed data, but this is not true of multiple imputation. Unlike single imputation, multiple imputation builds into the model the uncertaintyerror associated with the missing data. Therefore the process and subsequent estimation never depends on a single value. Additionally, another method for dealing the missing data, maximum likelihood produces almost identical results to multiple imputation and it does not require the missing information to be filled-in. What is Passive imputation Passive variables are functions of imputed variables. For example, let39s say we have a variable X with missing information but in my analytic model we will need to use X 2. In passive imputation we would impute X and then use those imputed values to create a quadratic term. This method is called quotimpute then transformquot (von Hippel, 2009). While th is appears to make sense, additional research (Seaman et al. 2012 Bartlett et al. 2014) has s hown that using this method is actually a misspecification of your imputation model and will lead to biased parameter estimates in your analytic model. There are better ways of dealing with transformations. How do I treat variable transformations such as logs, quadratics and interactions Most of the current literature on multiple imputation supports the method of treating variable transformations as quotjust another variablequot. For example, if you know that in your subsequent analytic model you are interesting in looking at the modifying effect of Z on the association between X and Y (i. e. an interaction between X and Z). This is a property of your data that you want to be maintained in the imputation. Using something like passive imputation, where the interaction is created after you impute X andor Z means that the filled-in values are imputed under a model assuming that Z is not a moderator of the association between X an Y. Thus, your imputation model is now misspecified. Should I include my dependent variable (DV) in my imputation model Yes An emphatic YES unless you would like to impute independent variables (IVs) assuming they are uncorrelated with your DV (Enders, 2010). Thus, causing the estimated association between you DV and IV39s to be biased toward the null (i. e. underestimated). Additionally, using imputed values of your DV is considered perfectly acceptable when you have good auxiliary variables in your imputation model (Enders, 2010 Johnson and Young, 2011 White et al. 2010). However, if good auxiliary variables are not available then you still INCLUDE your DV in the imputation model and then later restrict your analysis to only those observations with an observed DV value. Research has shown that imputing DV39s when auxiliary variables are not present can add unnecessary random variation into your imputed valu es (Allison, 2012). How much missing can I have and still get good estimates using MI Simulations have indicated that MI can perform well, under certain circumstances, even up to 50 missing observations (Allison, 2002). However, the larger the amount of missing information the higher the chance you will run into estimation problems during the imputation process and the lower the chance of meeting the MAR assumption unless it was planned missing (Johnson and Young, 2011). Additionally, as discussed further, the higher the FMI the more imputations are needed to reach good relative efficiency for effect estimates, especially standard errors. What should I report in my methods abut my imput ation Most papers mention if they performed multiple imputation but give very few if any details of how they implemented the method. In general, a basic description should include: Which statistical program was used to conduct the imputation. The type of imputation algorithm used (i. e. MVN or FCS). Some justification for choosing a particular imputation method. The number of imputed datasets ( m) created. The proportion of missing observations for each imputed variable. The variables used in the imputation model and why so your audience will know if you used a more inclusive strategy. This is particularly important when using auxiliary variables. This may seem like a lot, but probably would not require more than 4-5 sentences. Enders (2010) provides some examples of write-ups for particular scenarios. Additionally, MacKinnon (2010) discusses the reporting of MI procedures in medical journals. Main Take Always from this seminar: Multiple Imputation is always superior to any of the single imputation methods because: A single imputed value is never used The variance estimates reflect the appropriate amount of uncertainty surrounding parameter estimates There are several decisions to be made before performing a multiple imputation including distribution, auxiliary variables and number of imputations that can affect the quality of the imputation. Remember that multiple imputation is not magic, and while it can help increase power it should not be expected to provide quotsignificantquot effects when other techniques like listwise deletion fail to find significant associations. Multiple Imputation is one tool for researchers to address the very common problem of missing data. Allison (2002). Missing Data. Sage Publications. Allison (2012). Handling Missing Data by Maximum Likelihood. SAS Global Forum: Statistics and Data Analysis. Allison (2005). Imputation of Categorical Variables with PROC MI. SUGI 30 Proceedings - Philadelphia, Pennsylvania April 10-13, 2005. Barnard and Rubin (1999). Small-sample degrees of freedom with multiple imputation. Biometrika . 86(4), 948-955. Bartlett et al. (2014). Multiple imputation of covariates by fully conditional specific ation: Accommodating the substantive model. Stat Methods Med Res . Todd E. Bodner (2008).quotWhat Improves with Increased Missing Data Imputationsquot. Structural Equation Modeling: A Multidisciplinary Journal . 15:4, 651-675. Demirtas et al.(2008). Plausibility of multivariate normality assumption when multiply imputing non-gaussian continuous outcomes: a simulation assessment. Jour of Stat Computation amp Simulation . 78(1). Enders (2010). Applied Missing Data Analysis. The Guilford Press. Graham et al. (2007). How Many Imputations are Really Needed Some Practical Clarifications of Multiple Imputation Theory. Prev Sci, 8: 206-213. Horton et al. (2003) A potential for bias when rounding in multiple imputation. American Statistician. 57: 229-232. Lee and Carlin (2010). Multiple Imputation for missing data: Fully Conditional Specification versus Multivariate Normal Imputation. Am J Epidemiol . 171(5): 624-32. Lipsitz et al. (2002). A Degrees-of-Freedom Approximation in Multiple Imputation. J Statist Comput Simul, 72(4): 309-318. Little, and Rubin, D. B. (2002). Statistical Analysis with Missing Data . 2 nd edition, New York. John Wiley. Johnson and Young (2011). Towards Best Practices in analyszing Datasets with Missing Data: Comparisons and Recomendations. Journal of Marriage and Family, 73(5): 926-45. Mackinnon (2010). The use and reporting of multiple imputation in medical research a review. J Intern Med, 268: 586593. Editors: Harry T. Reis, Charles M. Judd (2000). Handbook of Research Methods in Social and Personality Psychology. Rubin (1976). Inference and Missing Data. Biometrika 63 (3), 581-592. Rubin (1987). Multiple Imputation for Nonresponse in Surveys. J. Wiley amp Sons, New York. Seaman et al. (2012). Multiple Imputation of missing covariates with non-linear effects: an evaluation of statistical methods. B MC Medical Research Methodology . 12(46). Schafer and Graham (2002) Missing data: our view of the state of the art. Psychol Methods, 7(2):147-77 van Buuren (2007). Multiple imputation of discrete and continuous data by fully conditional specification. Statistical Methods in Medical Research . 16: 219242 . von Hippel (2009). How to impute interactions, squares and other transformed variables. Sociol Methodol . 39:265-291. von Hippel and Lynch (2013). Efficiency Gains from Using Auxiliary Variables in Imputation. Cornell University Library . von Hippel (2013). Should a Normal Imputation Model be modified to Impute Skewed Variables . Sociological Methods amp Research, 42(1):105-138. White et al. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine . 30(4): 377-399. Young and Johnson (2011). Imputing the Missing Y39s: Implications for Survey Producers and Survey Users. Proceedings of the AAPOR Conference Abstracts . pp. 62426248. The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California. Imputation of categorical and continuous data - multivariate normal vs chained equations Question: Generally speaking, would you say that standard methods of multiple imputation (e. g. those available in PROC MI) have difficulty handling models with mixed (continuous and categorical) data Or would you think (generally) that the multivariate normality assumption is robust in the context of MI for handling continuous and categorical missing data. Answer: Opinion on this is somewhat mixed. A fair bit of work has been done on how to impute categorical data using the MVN model, and some papers have shown you can do quite well, provided you use so called adaptive rounding methods for rounding the continuous imputed data. For more on this, see: CA Bernaards, TR Belin, JL Schafer. Robustness of a multivariate normal approximation for imputation of incomplete binary data. Statistics in Medicine 200726:1368-1382. Lee and Carlin found that both chained equations and imputation via a MVN model worked well, even with some binary and ordinal variables: KJ Lee and JB Carlin. Multiple Imputation for Missing Data: Fully Conditional Specification Versus Multivariate Normal Imputation. American Journal of Epidemiology 2010171:624-632 In contrast, a paper by van Buuren concluded that the chained equation (also known as fully conditional specification (FCS)) approach is preferable in situations with a mixture of continuous and categorical data: S van Buuren. Multiple imputation of discrete and continuous data by fully conditional specification. Statistical Methods in Medical Research 200716:219-242 My personal opinion is that the chained equations approach is preferable with a mixture of continuous and categorical data.
Comments
Post a Comment